DETECTING POSITION OF WORD BREAKS IN A TEXTUAL LINE IMAGE
First Claim
1. A method for segmenting words from a textual line image, the method comprising the steps of:
- extracting features from the textual line image using a featurization component;
calculating breaks using the extracted features;
using a classifier for classifying each of the breaks into classes, the classes including an inter-word break class and an inter-character break class, and for determining probabilities that classified breaks are members of the classes; and
segmenting words from the textual line image using the breaks and probabilities.
3 Assignments
0 Petitions
Accused Products
Abstract
Line segmentation in an OCR process is performed to detect the positions of words within an input textual line image by extracting features from the input to locate breaks and then classifying the breaks into one of two break classes which include inter-word breaks and inter-character breaks. An output including the bounding boxes of the detected words and a probability that a given break belongs to the identified class can then be provided to downstream OCR or other components for post-processing. Advantageously, by reducing line segmentation to the extraction of features, including the position of each break and the number of break features, and break classification, the task of line segmentation is made less complex but with no loss of generality.
-
Citations
20 Claims
-
1. A method for segmenting words from a textual line image, the method comprising the steps of:
-
extracting features from the textual line image using a featurization component; calculating breaks using the extracted features; using a classifier for classifying each of the breaks into classes, the classes including an inter-word break class and an inter-character break class, and for determining probabilities that classified breaks are members of the classes; and segmenting words from the textual line image using the breaks and probabilities. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for segmenting and recognizing words in a textual line image, the method comprising the steps of:
-
applying featurization to the textual line image to extract numeric features from which breaks are calculated; classifying the breaks into one of two classes including an inter-word break class or an inter-character break class; determining probabilities that the classified breaks are validly classified into the one of two classes; extracting word features from words in the textual line image, the word features including at least one of word confidence, character confidence, word frequency, grammar, or word length; and selecting a line segmentation using the extracted numeric features and the extracted word features. - View Dependent Claims (16)
-
-
17. An optical character recognition system architecture, comprising:
-
one or more pre-processing stages configured for providing a gray-scale textual line image; a line segmentation engine that implements a featurization component and a break classifier, the featurization component being configured for extracting features from the textual line image to calculate breaks in the textual line image, and the break classifier being configured for i) classifying the breaks into classes including an inter-word break class and an inter-character break class, and for ii) determining probabilities that given breaks are members of the classes; and one or more post-processing stages configured for receiving the classified breaks and probabilities and for detecting words in the textual line image using the received classified breaks and probabilities. - View Dependent Claims (18, 19, 20)
-
Specification