OCR error correction methods and apparatus utilizing contextual comparison

US 5,850,480 A
Filed: 05/30/1996
Issued: 12/15/1998
Est. Priority Date: 05/30/1996
Status: Expired due to Fees

First Claim

Patent Images

1. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the lexicon as lexicon strings, the document processing system also having a recognition engine for generating at least one phantom character data table consisting of a set of cognate pairs of phantom characters and associated confidence values for each position of the field, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:

receiving at least one phantom character data table from the recognition engine;

generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field, and wherein each numeric value results from mathematical combination of the confidence values associated with each phantom character which matches a lexicon character within a predetermined number of positions of the corresponding position of the lexicon string, if none of the phantom characters received for a given position of the alphanumeric character string matches a lexicon character within the predetermined number of positions of the corresponding position of the lexicon string, a predetermined default confidence value is substituted for the phantom character confidence value in the mathematical combinations;

comparing the resulting numeric values generated for each lexicon string; and

selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention includes methods of correcting optical character recognition errors occurring during recognition of alphanumeric character strings contained within one or more predetermined types of alphanumeric character fields. The methods may be practiced with a document processing system having (1) a optical character recognition device for scanning documents and outputting bit-map image data; (2) a recognition engine for converting the bit-map image data into possibly correct alphanumeric characters with associated confidence values; and (3) at least one lexicon of character strings consisting of a list of at least a portion of all of the possible character string values for each of the fields being processed. The present invention corrects OCR errors by performing a contextual comparison analysis between the alphanumeric characters outputted from the recognition engine and the lexicon of character strings. A number of preferred embodiments, and several examples of the type of information which can be processed by those embodiments, are disclosed.

307 Citations

31 Claims

1. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the lexicon as lexicon strings, the document processing system also having a recognition engine for generating at least one phantom character data table consisting of a set of cognate pairs of phantom characters and associated confidence values for each position of the field, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:
- receiving at least one phantom character data table from the recognition engine;
  
  generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field, and wherein each numeric value results from mathematical combination of the confidence values associated with each phantom character which matches a lexicon character within a predetermined number of positions of the corresponding position of the lexicon string, if none of the phantom characters received for a given position of the alphanumeric character string matches a lexicon character within the predetermined number of positions of the corresponding position of the lexicon string, a predetermined default confidence value is substituted for the phantom character confidence value in the mathematical combinations;
  
  comparing the resulting numeric values generated for each lexicon string; and
  
  selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 3. A method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within a field as recited in claim 1, further comprising the step of:
    - modifying the phantom character data table by replacing the phantom character data from at least one position of said phantom character data table with all possible alphanumeric character values and at least one predetermined default confidence value.
  - 4. A method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within a field as recited in claim 1, further comprising the step of:
    - modifying the phantom character data table to include at least one additional phantom character data table position.
  - 5. A method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within a field as recited in claim 4, wherein the data contained within the additional positions of the phantom character data table consists of a dummy character and a predetermined associated default confidence value.
  - 6. A method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within a field as recited in claim 4, wherein the data contained within the additional positions of the phantom character data table consists of all possible alphanumeric character values and at least one predetermined associated default confidence value.
  - 7. A method of selecting the lexicon character string which most accurately represents an alphanumeric character string contained within a field as recited in claim 1, further comprising the steps of:
    - generating a distance value relating to the probability that the selected lexicon string accurately represents the alphanumeric character string contained within the field; and
      
      outputting the selected lexicon character string if the distance value is one of above or below a predetermined threshold value and transmitting a signal indicating indeterminate results if the distance value is the other of above or below the threshold value.
  - 8. A method of selecting the lexicon character string which most accurately represents an alphanumeric character string contained within a field as recited in claim 7, wherein the numeric values of said step of generating a numeric value result from mathematical combination of the confidence values associated with each phantom character which matches the lexicon character in the corresponding position of the lexicon string and the frequency value associated with each lexicon string, if none of the phantom characters received for a given position of the alphanumeric character string matches the character in the corresponding position of the lexicon character string, a predetermined default confidence value is substituted for the phantom character confidence value in said mathematical combination.
  - 9. A method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within a field as recited in claim 1, wherein said step of generating a numeric value comprises:
    - generating a numeric value for each lexicon string having the same number of character positions as the phantom character data table if either, at least one of the phantom characters in the first position of the phantom character data table matches the lexicon character in the corresponding position of the lexicon string, or at least one of the phantom characters in the second position of the phantom character data table matches the character lexicon in the corresponding position of the lexicon string,generating a plurality of numeric values for each lexicon string having at least one more character position than the phantom character data table, each of said numeric values being generated while at least one position of each lexicon string is masked; and
      
      generating a plurality of numeric values for each lexicon string having at least one less character position than the phantom character data table if either, at least one of the phantom characters in the first position of the phantom character data table matches the lexicon character in the corresponding position of the lexicon string, or at least one of the phantom characters in the second position of the phantom character data table matches the character in the corresponding position of the lexicon string, each of the numeric values being generated while at least one character position of the phantom character data table is masked.
  - 10. A method of selecting the lexicon character string which most accurately represents an alphanumeric character string contained within a field as recited in claim 1, wherein the predetermined number positions of said step of generating a numeric value equals one position, and wherein the numeric values result from mathematical combinations of the phantom character confidence values and the default confidence values selected using a recursive trinary-tree matching algorithm.
  - 11. A method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within a field as recited in claim 1, wherein the numerical values of said step of generating a numeric value result from mathematical combination of the phantom character confidence values and the default confidence values and the default confidence values selected using a recursive trinary tree matching algorithm.
  - 12. A method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within a field as recited in claim 11, wherein the predetermined number of positions of said step of generating a numerical value equals one position.

2. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the lexicon as lexicon strings, the document processing system also having a recognition engine for generating at least one phantom character data table consisting of a set of cognate pairs of phantom characters and associated confidence values for each position of the field, wherein the lexicon character strings listed in the lexicon have associated frequency values, each frequency value relating to the frequency with which its associated lexicon character string is actually utilized when compared with the set of all possible alphanumeric character strings, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:
- receiving at least one phantom character data table from the recognition engine;
  
  generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field, and wherein each numeric value results from mathematical combination of the confidence values associated with each phantom character which matches a lexicon character within a predetermined number of positions of the corresponding position of the lexicon string and the frequency value associated with each lexicon string, if none of the phantom characters received for a given position of the alphanumeric character string matches a lexicon character within the predetermined number of positions of the corresponding position of the lexicon string, a predetermined default confidence value is substituted for the phantom character confidence value in the mathematical combination;
  
  comparing the resulting numeric values generated for each lexicon string; and
  
  selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field.

13. For use with a document processing system having an optical character recognition device for scanning documents with a composite alphanumeric character string contained in a composite field consisting of at least two related sub-fields wherein each sub-field has a number of character positions, the document processing system also having a memory with a lexicon of composite lexicon strings, each composite lexicon string consisting of at least two lexicon sub-strings, wherein at least a portion of all possible alphanumeric character strings for at least one sub-field can be listed in the lexicon, the document processing system also having a recognition engine for generating at least one phantom character data table for each sub-field of the composite field, each data table consisting of a set of cognate pairs of phantom characters with associated confidence values for each position of the sub-field, a method of selecting the composite lexicon string which most accurately represents a composite alphanumeric character string contained within the composite field, said method comprising the steps of:
- receiving a first phantom character data table from the recognition engine for the first sub-field of the composite field;
  
  generating a set of first phantom character sub-strings from the data in the first data table, said first phantom character sub-strings possibly accurately representing the alphanumeric character sub-string contained within the first sub-field of the composite field;
  
  generating a first numeric value for each of at least some of the first phantom character sub-strings, wherein each of the first numeric values relates to the probability that its associated phantom character sub-string accurately represents the alphanumeric character sub-string contained within the first sub-field;
  
  receiving at least one phantom character data table from the recognition engine for each of the other sub-fields;
  
  generating additional numeric values for at least some of the lexicon sub-strings of each of the other sub-fields from at least some of the composite lexicon strings having a first sub-string which matches one of the phantom character sub-strings for the first sub-field, wherein each additional numeric value relates to the probability that its associated lexicon sub-string accurately represents the alphanumeric character sub-string contained within one of the other sub-fields;
  
  generating a composite numeric value for each of at least some of the composite lexicon strings, wherein each composite numeric value relates to the probability that its associated composite lexicon string accurately represents the composite alphanumeric character string contained within the composite field;
  
  comparing the composite numeric values generated for each composite lexicon string; and
  
  selecting the composite lexicon string having an associated composite numeric value indicating that the selected composite lexicon string most accurately represents the composite alphanumeric character string contained within the composite field.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 14. A method of selecting the composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 13, wherein said step of generating a set of first phantom character sub-strings comprises:
    - generating a first set of derivative data tables, the first set of derivative data tables consisting of the phantom character table data of the first data table if the number of positions in the first data table equals some predetermined value, a plurality of first derivative data tables created by masking at least one position of the first data table if the number of positions in the data table is greater than the predetermined value, and at least one first derivative data table created by inserting dummy characters and at least one default confidence value into at least one position of the first data table if the number of positions in the first data table is less than the predetermined value;
      
      generating a plurality of phantom character sub-strings for the first sub-field from the first set of derivative phantom character data tables.
  - 15. A method of selecting the composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 14, wherein said step of generating a plurality of phantom character sub-strings comprises:
    - generating all possible phantom character sub-strings which can be created from the first set of derivative phantom character data tables, wherein the phantom character sub-strings are generated while the data in each position of each derivative data table is replaced with all possible character values and at least one default confidence value, one position at a time.
  - 16. A method of selecting the composite lexicon character string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 13, wherein said step of generating additional numeric values includes the step of:
    - generating an additional set of derivative data tables for each of at least one of the other sub-fields, at least one of the additional sets of phantom character data tables comprising,the additional data table if the number of positions in the additional data table is equal to some predetermined value, a plurality of additional derivative data tables created by masking at least one position of the additional data table if the number of positions in the additional data table is greater than the predetermined value, and at least one additional derivative data table created by inserting dummy characters and at least one associated default confidence value into each of at least one position of the additional data table if the number of positions in the additional data table is less than the predetermined value.
  - 17. A method of selecting the composite lexicon character string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 16, wherein said step of generating additional numeric values includes the step of generating a set of all possible additional phantom character sub-strings which can be created from each of at least one additional set of derivative data tables, wherein a plurality of phantom character sub-strings are generated while the data in each position of each additional derivative data table is individually replaced with all possible character values and at least one default confidence value, and wherein the additional numeric values of said step of generating additional numeric values result from mathematical combination of the confidence values associated with each phantom character in a given position of one of the derivative data tables which matches the lexicon character in the corresponding position of the corresponding lexicon sub-string, if none of the phantom characters in a given position of one of the derivative data tables matches the character in the corresponding position of the corresponding lexicon sub-string, a predetermined default confidence value is substituted for the phantom character confidence value in said mathematical combination.
  - 18. A method of selecting the composite lexicon character string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 14, wherein at least one of the additional sets of data tables consists of,a dummy character data table if the number of positions in the additional data table is above or below a predetermined reference number by some predetermined range value.
  - 19. A method of selecting the composite lexicon character string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 14, wherein the numeric values of said step of generating a first numeric value result from mathematical combination of the phantom character values and the default confidence values selected using a recursive trinary-tree matching algorithm.
  - 20. A method of selecting the composite lexicon character string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 14, wherein said additional numeric values result form mathematical combination of the confidence values associated with each phantom character in a given position of one of the other data tables which matches a lexicon character within one position of the corresponding position of the corresponding lexicon sub-string, if none of the phantom characters in a given position of the phantom character data table matches a lexicon character within one position of the corresponding position of the corresponding lexicon sub-string, a predetermined default confidence value is substituted for the phantom character confidence value in said mathematical combination, and wherein the numeric values of said step of generating numeric values result from mathematical combination of the phantom character confidence values and the default confidence values selected using a recursive trinary-tree matching algorithm.
  - 21. A method of selecting the composite lexicon character string which most accurately represents a composite alphanumeric character string contained in a composite field as recited in claim 14, further comprising the steps of:
    - generating a distance value relating to the probability that the selected composite lexicon string accurately represents the composite alphanumeric character string contained within the composite field; and
      
      outputting the selected composite lexicon character string if the distance value is one of above or below a predetermined threshold value and transmitting a signal indicating indeterminate results if the distance value is the other of above or below the threshold value.
  - 22. A method of selecting the composite lexicon character string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 13, wherein the lexicon sub-strings of at least one of the sub-fields have associated frequency values, each frequency value relating to the frequency with which its associated lexicon character sub-string is actually utilized when compared with the set of all possible alphanumeric character sub-strings, and wherein each of the composite numeric values of said step of generating composite numeric values results from mathematical combination of the first numeric value, the additional numeric values associated with each selected lexicon sub-string of at least one of the other sub-fields, and the frequency values associated with each lexicon sub-string of at least one of the other sub-fields.

23. For use with a document processing system having an optical character recognition device for scanning documents with a composite alphanumeric character string contained in a composite field consisting of at least two related sub-fields wherein each sub-field has a number of character positions, the document processing system also having a memory with a lexicon of composite lexicon strings, each composite lexicon string consisting of at least two lexicon sub-strings contained within at least two lexicon sub-fields of the composite lexicon field, at least some of the composite lexicon strings including a plurality of alternative lexicon sub-strings for a single lexicon sub-field, wherein at least a portion of all possible alphanumeric character strings can be listed in the lexicon, a method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within the composite field, said method comprising the steps of:
- generating a numeric value for at least one of the lexicon sub-strings from each lexicon sub-field of at least some of the composite lexicon strings, wherein each of the numeric values relates to the probability that its associated lexicon sub-string accurately represents the alphanumeric character sub-string contained within one of the alphanumeric character string sub-fields;
  
  generating an amalgamated composite lexicon string for each of at least some of the composite lexicon strings by collecting the lexicon sub-strings of each of the lexicon sub-fields having a numeric value indicating that its associated lexicon sub-string most accurately represents the alphanumeric character sub-string for one lexicon sub-field;
  
  generating a composite numeric value for each of at least some of the amalgamated composite lexicon strings, wherein each composite numeric value relates to the probability that its associated amalgamated composite lexicon string accurately represents the composite alphanumeric character string contained within the composite field;
  
  comparing the composite numeric values generated for each amalgamated composite lexicon string; and
  
  selecting the amalgamated composite lexicon string having an associated composite numeric value indicting that the selected amalgamated composite lexicon string most accurately represents the composite alphanumeric character string contained within the composite alphanumeric character string field.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
- - 24. A method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field, as recited in claim 23,wherein the document processing system also has as recognition engine for generating at least one phantom character data table for each sub-field of the composite field, each data table consisting of a set of cognate pairs of phantom characters and associated confidence values for each position of the alphanumeric character string field;
    - wherein said method further comprises the step of receiving at least one phantom character data table from the recognition engine; and
      
      wherein the numeric values from said step of generating a numeric value results from mathematical combination of the confidence values associated with each phantom character which matches a lexicon character within a predetermined number of positions of the corresponding position of the lexicon sub-string if none of the phantom characters received for a given position of the alphanumeric character string matches a lexicon character within the predetermined number of positions of the corresponding position of the lexicon string, a predetermined default confidence value is substituted for the phantom character confidence value in the mathematical combination.
  - 25. A method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field, as recited in claim 24,wherein at least some of the phantom character data tables generated by the recognition engine consist of a numeric part, an alphabetic part and an alphanumeric part;
    - wherein said step of generating a numeric value further comprises the step of determining whether each lexicon character is numeric, alphabetic, or alphanumeric; and
      
      wherein said step of receiving at least one phantom character data table comprises, receiving the part of at least one phantom character data table which is of the same type as each lexicon character.
  - 26. A method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field, as recited in claim 24, further comprising the step of:
    - modifying at least one of the phantom character data tables by replacing the phantom character data from at least one position of said phantom character data table with all possible alphanumeric character values and at least one predetermined default confidence value.
  - 27. A method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field, as recited in claim 24, further comprising the step of modifying at least one of the phantom character data tables to include at least one additional phantom character data table position, wherein the data contained within the additional positions of the data table consists of a dummy character and a predetermined associated default confidence value.
  - 28. A method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field, as recited in claim 24, wherein the numeric values of said step of generating numeric values result from mathematical combination of the phantom character confidence values and the default confidence values selected using a recursive trinary tree matching algorithm.
  - 29. A method of selecting an amalgamated composite lexicon string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 24, further comprising the steps of:
    - generating a distance value relating to the probability that the selected amalgamated composite lexicon string accurately represents the composite alphanumeric character string contained within the composite field; and
      
      outputting the selected amalgamated composite lexicon character string if the distance value is one of above or below a predetermined threshold value and transmitting a signal indicating indeterminate results if the distance value is the other of above or below the threshold value.
  - 30. A method of selecting an amalgamated composite lexicon character string which most accurately represents a composite alphanumeric character string contained within a composite field as recited in claim 24,wherein the lexicon sub-strings of at least one of the sub-fields have associated frequency values, each frequency value relating to the frequency with which its associated lexicon character sub-string is actually utilized when compared with the set of all possible alphanumeric character sub-strings;
    - andwherein each of the composite numeric values of said step of generating composite numeric values results from mathematical combination of the numeric values associated with each selected lexicon sub-string and the frequency values associated with each selected lexicon sub-string.

31. For use with a document processing system having an optical character recognition device for scanning documents with one or more discrete alphanumeric characters collectively forming an alphanumeric character string contained in a field having a number of character positions, the document processing system also having a memory with a predetermined and static lexicon of character strings wherein at least a portion of all of the possible alphanumeric character strings are listed in the static lexicon as lexicon strings, a method of selecting the lexicon string which most accurately represents an alphanumeric character string contained within the field, said method comprising the steps of:
- generating a numeric value for each of at least some of the lexicon strings, wherein each numeric value relates to the probability that its associated lexicon string accurately represents the alphanumeric character string contained within the field;
  
  comparing the resulting numeric values generated for each lexicon string; and
  
  selecting the lexicon string having a resulting associated numeric value indicating that the selected lexicon string most accurately represents the alphanumeric character string contained within the field.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Scan-Optics LLC
Original Assignee
Scan-Optics, Incorporated (Scan-Optics LLC)
Inventors
Scanlon, Edward Francis
Primary Examiner(s)
SHALWALA, BIPIN H

Application Number

US08/656,417
Time in Patent Office

929 Days
Field of Search

382/159, 382/161, 382/184, 382/187, 382/230, 382/231, 382/310, 382/309, 382/229, 707/500, 707/530
US Class Current

382/229
CPC Class Codes

G06V 30/10 Character recognition

G06V 30/268 Lexical context

OCR error correction methods and apparatus utilizing contextual comparison

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

307 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

OCR error correction methods and apparatus utilizing contextual comparison

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

307 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links