DETECTING POSITION OF WORD BREAKS IN A TEXTUAL LINE IMAGE

US 20110243445A1
Filed: 03/30/2010
Published: 10/06/2011
Est. Priority Date: 03/30/2010
Status: Active Grant

First Claim

Patent Images

1. A method for segmenting words from a textual line image, the method comprising the steps of:

extracting features from the textual line image using a featurization component;

calculating breaks using the extracted features;

using a classifier for classifying each of the breaks into classes, the classes including an inter-word break class and an inter-character break class, and for determining probabilities that classified breaks are members of the classes; and

segmenting words from the textual line image using the breaks and probabilities.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Line segmentation in an OCR process is performed to detect the positions of words within an input textual line image by extracting features from the input to locate breaks and then classifying the breaks into one of two break classes which include inter-word breaks and inter-character breaks. An output including the bounding boxes of the detected words and a probability that a given break belongs to the identified class can then be provided to downstream OCR or other components for post-processing. Advantageously, by reducing line segmentation to the extraction of features, including the position of each break and the number of break features, and break classification, the task of line segmentation is made less complex but with no loss of generality.

Citations

20 Claims

1. A method for segmenting words from a textual line image, the method comprising the steps of:
- extracting features from the textual line image using a featurization component;
  
  calculating breaks using the extracted features;
  
  using a classifier for classifying each of the breaks into classes, the classes including an inter-word break class and an inter-character break class, and for determining probabilities that classified breaks are members of the classes; and
  
  segmenting words from the textual line image using the breaks and probabilities.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1 in which the extracted features are selected from ones of absolute features, relative line features, relative break features, relative ink features, relative ink-to-ink features, relative break proximity features, or word recognition features.
  - 3. The method of claim 2 in which the absolute features are selected from one or more of break width in pixels, distribution of all break widths in pixels, x-height in pixels, stroke width in pixels, textual line image height in pixels, textual line image width in pixels, total break width in pixels, ink width in pixels, ink height in pixels, distribution of ink-to-ink widths in pixels, or ink-to-ink area.
  - 4. The method of claim 3 in which the distribution of all break widths includes at least one of 90^thpercentile of the distribution, 50^thpercentile of the distribution, 10^thpercentile of the distribution, or a number of breaks in the textual line image.
  - 5. The method of claim 3 in which the distribution of ink-to-ink widths includes at least one of 100^thpercentile of the distribution, 90^thpercentile of the distribution, 50^thpercentile of the distribution, 10^thpercentile of the distribution, or 0^thpercentile of the distribution.
  - 6. The method of claim 2 in which the relative line features are selected from one or more of estimated number of characters, number of breaks per estimated number of characters, all breaks width per line width, or median break width per x-height.
  - 7. The method of claim 2 in which the relative break features are selected from one or more of break width per x-height, break width per 90^thpercentile break distribution, break width per 50^thpercentile break distribution, break width per 10^thpercentile break distribution, break width per previous break width, or break width per next break width.
  - 8. The method of claim 2 in which the relative ink features are selected from one or more of distance from ink bottom to baseline per x-height and distance from ink top to x-height per x-height.
  - 9. The method of claim 2 in which the relative ink-to-ink features are selected from one or more of 100^thpercentile of ink-to-ink width distribution per x-height, 90^thpercentile of ink-to-ink width distribution per x-height, 60^thpercentile of ink-to-ink width distribution per x-height, 10^thpercentile of ink-to-ink width distribution per x-height, 0^thpercentile of ink-to-ink width distribution per x-height, 100^thpercentile of ink-to-ink width distribution per median break width, 90^thpercentile of ink-to-ink width distribution per median break width, 60^thpercentile of ink-to-ink width distribution per median break width, 10^thpercentile of ink-to-ink width distribution per median break width, 0^thpercentile of ink-to-ink width distribution per median break width, or ink-to-ink area per effective ink-to-ink height.
  - 10. The method of claim 2 in which the relative break proximity features are selected from one or more of surrounding break width per x-height or surrounding break width per median break width.
  - 11. The method of claim 2 in which the word recognition features are selected from one or more of word confidence, character confidence for each character in a word, word frequency as reported by a language model, advanced language model features, or word length in characters.
  - 12. The method of claim 1 in which the classifier is selected from one of decision tree classifier, AdaBoost classifier that is configured on top of the decision tree classifier, clustering classifier, neural network classifier, or iterative gradient descender classifier.
  - 13. The method of claim 1 in which the classifier is trained using results provided by engines which are located upstream or downstream of the featurization component and classifier.
  - 14. The method of claim 1 in which the classifier is trained using an independent scope implementation.

15. A method for segmenting and recognizing words in a textual line image, the method comprising the steps of:
- applying featurization to the textual line image to extract numeric features from which breaks are calculated;
  
  classifying the breaks into one of two classes including an inter-word break class or an inter-character break class;
  
  determining probabilities that the classified breaks are validly classified into the one of two classes;
  
  extracting word features from words in the textual line image, the word features including at least one of word confidence, character confidence, word frequency, grammar, or word length; and
  
  selecting a line segmentation using the extracted numeric features and the extracted word features.
- View Dependent Claims (16)
- - 16. The method of claim 15 including a further step of using the probabilities when selecting the line segmentation.

17. An optical character recognition system architecture, comprising:
- one or more pre-processing stages configured for providing a gray-scale textual line image;
  
  a line segmentation engine that implements a featurization component and a break classifier, the featurization component being configured for extracting features from the textual line image to calculate breaks in the textual line image, and the break classifier being configured for i) classifying the breaks into classes including an inter-word break class and an inter-character break class, and for ii) determining probabilities that given breaks are members of the classes; and
  
  one or more post-processing stages configured for receiving the classified breaks and probabilities and for detecting words in the textual line image using the received classified breaks and probabilities.
- View Dependent Claims (18, 19, 20)
- - 18. The optical character recognition system architecture of claim 17 further including a word break lattice engine configured for generating a word break lattice including one or more words using the breaks in the textual line image.
  - 19. The optical character recognition system architecture of claim 18 further including a word recognizer that is combined with the line segmentation engine, the word recognizer being configured to extract word features from each of the words in the word lattice.
  - 20. The optical character recognition system architecture of claim 19 further including a word break beam search engine configured for picking a line segmentation using the extracted word features and the extracted textual line features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Dresevic, Bodin, Radakovic, Bogdan, Uzelac, Aleksandar, Galic, Sasa

Granted Patent

US 8,345,978 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/177
CPC Class Codes

G06V 30/10   Character recognition

G06V 30/15   Cutting or merging image el...

G06V 30/153   using recognition of charac...

DETECTING POSITION OF WORD BREAKS IN A TEXTUAL LINE IMAGE

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

DETECTING POSITION OF WORD BREAKS IN A TEXTUAL LINE IMAGE

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links