WORD RECOGNITION OF TEXT UNDERGOING AN OCR PROCESS
First Claim
1. A method for identifying words in a textual image undergoing an OCR process, comprising:
- (a) receiving a bitmap of an input image that includes textual lines that have been segmented by chop lines to define symbols therebetween, wherein each of the chop lines is associated with a chop line confidence level reflecting a degree to which the respective chop line properly segments the textual line into individual characters;
(b) maintaining a data structure that stores data elements including the bitmap, the chop lines with their respective chop line confidence levels and the symbols;
(c) producing a first set of candidate characters with character confidence levels associated therewith from a first subset of the data elements in the data structure, the first subset of data elements having respective candidate confidence levels that each exceed a respective one of a first set of data element threshold values;
(d) updating the data structure by further including the first set of candidate characters with their respective character confidence levels;
(e) identifying at least a first word from the first set of candidate characters, wherein the first word has a first word confidence level associated therewith;
(f) wherein if the first word confidence level is below a first word threshold level, updating the data structure to further include the first word and its first word confidence level and(g) repeating steps (c)-(e) for a second subset of the data elements in the updated data structure having respective data element confidence levels that each exceed a respective one of a second set of data element threshold values lower than the first set of data element threshold values to thereby produce at least a second word and a second word confidence level associated therewith.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for identifying words in a textual image undergoing optical character recognition includes receiving a bitmap of an input image which includes textual lines that have been segmented by a plurality of chop lines. The chop lines are each associated with a confidence level reflecting a degree to which the respective chop line properly segments the textual line into individual characters. One or more words are identified in one of the textual lines based at least in part on the textual lines and a first subset of the plurality of chop lines which have a chop line confidence level above a first threshold value. If the first word is not associated with a sufficiently high word confidence level, at least a second word in the textual line is identified based at least in part on a second subset of the plurality of chop lines which have a confidence level above a second threshold value lower than the first threshold value.
-
Citations
20 Claims
-
1. A method for identifying words in a textual image undergoing an OCR process, comprising:
-
(a) receiving a bitmap of an input image that includes textual lines that have been segmented by chop lines to define symbols therebetween, wherein each of the chop lines is associated with a chop line confidence level reflecting a degree to which the respective chop line properly segments the textual line into individual characters; (b) maintaining a data structure that stores data elements including the bitmap, the chop lines with their respective chop line confidence levels and the symbols; (c) producing a first set of candidate characters with character confidence levels associated therewith from a first subset of the data elements in the data structure, the first subset of data elements having respective candidate confidence levels that each exceed a respective one of a first set of data element threshold values; (d) updating the data structure by further including the first set of candidate characters with their respective character confidence levels; (e) identifying at least a first word from the first set of candidate characters, wherein the first word has a first word confidence level associated therewith; (f) wherein if the first word confidence level is below a first word threshold level, updating the data structure to further include the first word and its first word confidence level and (g) repeating steps (c)-(e) for a second subset of the data elements in the updated data structure having respective data element confidence levels that each exceed a respective one of a second set of data element threshold values lower than the first set of data element threshold values to thereby produce at least a second word and a second word confidence level associated therewith. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for identifying words in a textual image undergoing an OCR process, comprising:
-
an input component for receiving a bitmap of an input image that includes text lines that have been segmented by chop lines to define symbols therebetween, wherein a confidence level reflecting chop line accuracy is associated with each chop line; a data structure for storing data elements that include the bitmap, the chop lines with their respective chop line confidence levels and the symbols; a character recognition component for producing a first set of candidate characters with confidence levels associated therewith from a first subset of the data elements in the data structure having respective confidence levels that each exceed a respective one of a first set of data element threshold values, wherein the character recognition component is configured to cause the data structure to be updated by further including in the data structure the first set of candidate characters with their respective character confidence levels; and a word search component for identifying at least a first word from the first set of candidate characters, wherein the first word has a first word confidence level associated therewith, wherein, the word recognition component is configured to cause the data structure to be updated to further include the first word and its first word confidence level if the first word confidence level is below a first word threshold level, wherein the character recognition component and the word search component are further configured to produce a second set of candidate characters and at least a second word, respectively, from data elements in the updated data structure which have respective confidence levels that each exceed a respective one of a second set of data element threshold values less than the first set of data element threshold values. - View Dependent Claims (12, 13, 14)
-
-
15. A medium comprising instructions executable by a computing system, wherein the instructions configure the computing system to perform a method for identifying words in a textual image undergoing optical character recognition, comprising:
-
receiving a bitmap of an input image that includes textual lines that have been segmented by a plurality of chop lines that are each associated with a confidence level reflecting a degree to which the respective chop line properly segments the textual line into individual characters; identifying at least a first word in one of the textual lines based at least in part on the textual lines and a first subset of the plurality of chop lines which have a chop line confidence level above a first threshold value; and if the first word is not associated with a sufficiently high word confidence level, identifying at least a second word in the one textual line based at least in part on a second subset of the plurality of chop lines which have a confidence level above a second threshold value lower than the first threshold value. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification