OCR error correction methods and apparatus utilizing contextual comparison
First Claim
1. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the lexicon as lexicon strings, the document processing system also having a recognition engine for generating at least one phantom character data table consisting of a set of cognate pairs of phantom characters and associated confidence values for each position of the field, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:
- receiving at least one phantom character data table from the recognition engine;
generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field, and wherein each numeric value results from mathematical combination of the confidence values associated with each phantom character which matches a lexicon character within a predetermined number of positions of the corresponding position of the lexicon string, if none of the phantom characters received for a given position of the alphanumeric character string matches a lexicon character within the predetermined number of positions of the corresponding position of the lexicon string, a predetermined default confidence value is substituted for the phantom character confidence value in the mathematical combinations;
comparing the resulting numeric values generated for each lexicon string; and
selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field.
7 Assignments
0 Petitions
Accused Products
Abstract
The present invention includes methods of correcting optical character recognition errors occurring during recognition of alphanumeric character strings contained within one or more predetermined types of alphanumeric character fields. The methods may be practiced with a document processing system having (1) a optical character recognition device for scanning documents and outputting bit-map image data; (2) a recognition engine for converting the bit-map image data into possibly correct alphanumeric characters with associated confidence values; and (3) at least one lexicon of character strings consisting of a list of at least a portion of all of the possible character string values for each of the fields being processed. The present invention corrects OCR errors by performing a contextual comparison analysis between the alphanumeric characters outputted from the recognition engine and the lexicon of character strings. A number of preferred embodiments, and several examples of the type of information which can be processed by those embodiments, are disclosed.
307 Citations
31 Claims
-
1. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the lexicon as lexicon strings, the document processing system also having a recognition engine for generating at least one phantom character data table consisting of a set of cognate pairs of phantom characters and associated confidence values for each position of the field, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:
-
receiving at least one phantom character data table from the recognition engine; generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field, and wherein each numeric value results from mathematical combination of the confidence values associated with each phantom character which matches a lexicon character within a predetermined number of positions of the corresponding position of the lexicon string, if none of the phantom characters received for a given position of the alphanumeric character string matches a lexicon character within the predetermined number of positions of the corresponding position of the lexicon string, a predetermined default confidence value is substituted for the phantom character confidence value in the mathematical combinations; comparing the resulting numeric values generated for each lexicon string; and selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
2. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the lexicon as lexicon strings, the document processing system also having a recognition engine for generating at least one phantom character data table consisting of a set of cognate pairs of phantom characters and associated confidence values for each position of the field, wherein the lexicon character strings listed in the lexicon have associated frequency values, each frequency value relating to the frequency with which its associated lexicon character string is actually utilized when compared with the set of all possible alphanumeric character strings, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:
-
receiving at least one phantom character data table from the recognition engine; generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field, and wherein each numeric value results from mathematical combination of the confidence values associated with each phantom character which matches a lexicon character within a predetermined number of positions of the corresponding position of the lexicon string and the frequency value associated with each lexicon string, if none of the phantom characters received for a given position of the alphanumeric character string matches a lexicon character within the predetermined number of positions of the corresponding position of the lexicon string, a predetermined default confidence value is substituted for the phantom character confidence value in the mathematical combination; comparing the resulting numeric values generated for each lexicon string; and selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field.
-
-
13. For use with a document processing system having an optical character recognition device for scanning documents with a composite alphanumeric character string contained in a composite field consisting of at least two related sub-fields wherein each sub-field has a number of character positions, the document processing system also having a memory with a lexicon of composite lexicon strings, each composite lexicon string consisting of at least two lexicon sub-strings, wherein at least a portion of all possible alphanumeric character strings for at least one sub-field can be listed in the lexicon, the document processing system also having a recognition engine for generating at least one phantom character data table for each sub-field of the composite field, each data table consisting of a set of cognate pairs of phantom characters with associated confidence values for each position of the sub-field, a method of selecting the composite lexicon string which most accurately represents a composite alphanumeric character string contained within the composite field, said method comprising the steps of:
-
receiving a first phantom character data table from the recognition engine for the first sub-field of the composite field; generating a set of first phantom character sub-strings from the data in the first data table, said first phantom character sub-strings possibly accurately representing the alphanumeric character sub-string contained within the first sub-field of the composite field; generating a first numeric value for each of at least some of the first phantom character sub-strings, wherein each of the first numeric values relates to the probability that its associated phantom character sub-string accurately represents the alphanumeric character sub-string contained within the first sub-field; receiving at least one phantom character data table from the recognition engine for each of the other sub-fields; generating additional numeric values for at least some of the lexicon sub-strings of each of the other sub-fields from at least some of the composite lexicon strings having a first sub-string which matches one of the phantom character sub-strings for the first sub-field, wherein each additional numeric value relates to the probability that its associated lexicon sub-string accurately represents the alphanumeric character sub-string contained within one of the other sub-fields; generating a composite numeric value for each of at least some of the composite lexicon strings, wherein each composite numeric value relates to the probability that its associated composite lexicon string accurately represents the composite alphanumeric character string contained within the composite field; comparing the composite numeric values generated for each composite lexicon string; and selecting the composite lexicon string having an associated composite numeric value indicating that the selected composite lexicon string most accurately represents the composite alphanumeric character string contained within the composite field. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. For use with a document processing system having an optical character recognition device for scanning documents with a composite alphanumeric character string contained in a composite field consisting of at least two related sub-fields wherein each sub-field has a number of character positions, the document processing system also having a memory with a lexicon of composite lexicon strings, each composite lexicon string consisting of at least two lexicon sub-strings contained within at least two lexicon sub-fields of the composite lexicon field, at least some of the composite lexicon strings including a plurality of alternative lexicon sub-strings for a single lexicon sub-field, wherein at least a portion of all possible alphanumeric character strings can be listed in the lexicon, a method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within the composite field, said method comprising the steps of:
-
generating a numeric value for at least one of the lexicon sub-strings from each lexicon sub-field of at least some of the composite lexicon strings, wherein each of the numeric values relates to the probability that its associated lexicon sub-string accurately represents the alphanumeric character sub-string contained within one of the alphanumeric character string sub-fields; generating an amalgamated composite lexicon string for each of at least some of the composite lexicon strings by collecting the lexicon sub-strings of each of the lexicon sub-fields having a numeric value indicating that its associated lexicon sub-string most accurately represents the alphanumeric character sub-string for one lexicon sub-field; generating a composite numeric value for each of at least some of the amalgamated composite lexicon strings, wherein each composite numeric value relates to the probability that its associated amalgamated composite lexicon string accurately represents the composite alphanumeric character string contained within the composite field; comparing the composite numeric values generated for each amalgamated composite lexicon string; and selecting the amalgamated composite lexicon string having an associated composite numeric value indicting that the selected amalgamated composite lexicon string most accurately represents the composite alphanumeric character string contained within the composite alphanumeric character string field. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
-
31. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a predetermined and static lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the static lexicon as lexicon strings, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:
-
generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field; comparing the resulting numeric values generated for each lexicon string; and selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field.
-
Specification