Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique

US 6,205,261 B1
Filed: 02/05/1998
Issued: 03/20/2001
Est. Priority Date: 02/05/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method of recognizing at least one word in a document, the word including at least one predetermined character member, the method comprising the steps of:

a) providing a recognized word based on the word in the document;

b) determining whether the recognized word has been misrecognized;

c) providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;

d) providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members, a content of each confusion set being independent of the recognized word;

e) comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;

f) eliminating the current reference word if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the current reference word and if the character members from the corresponding character sequences in the misrecognized word and the current reference word are not from the same confusion set;

g) repeating steps e) and f) for each reference word of the set of reference words, the remaining non-eliminated reference words comprising a set of candidate reference words; and

h) selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the selected candidate reference word comprising a replacement word for the misrecognized word.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition (“OCR”) technique. If an incorrect word is found in the electronic document, the present invention generates at least one reference word and selects the reference word that is the most likely correct replacement for the incorrect word. This selection is accomplished by comparing each character member of every reference word to a plurality of confusion sets. On the basis of this comparison, the reference words are reduced to a smaller candidate set of reference words, from which a reference word for replacing the incorrect word is selected on the basis of predetermined criteria.

176 Citations

43 Claims

1. A method of recognizing at least one word in a document, the word including at least one predetermined character member, the method comprising the steps of:
- a) providing a recognized word based on the word in the document;
  
  b) determining whether the recognized word has been misrecognized;
  
  c) providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
  
  d) providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members, a content of each confusion set being independent of the recognized word;
  
  e) comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
  
  f) eliminating the current reference word if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the current reference word and if the character members from the corresponding character sequences in the misrecognized word and the current reference word are not from the same confusion set;
  
  g) repeating steps e) and f) for each reference word of the set of reference words, the remaining non-eliminated reference words comprising a set of candidate reference words; and
  
  h) selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the selected candidate reference word comprising a replacement word for the misrecognized word.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method according to claim 1, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one alphanumeric character.
  - 3. The method according to claim 1, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one number.
  - 4. The method according to claim 1, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one alphabetical letter.
  - 5. The method according to claim 1, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one typographic character.
  - 6. The method according to claim 1, wherein:
7. The method according to claim 1, wherein:
- the step b) comprises determining whether the recognized word comprises at least one of a correctly spelled word, a grammatically correct word, and a contextually correct word.
8. The method according to claim 1, wherein the confusion sets are derived from at least one confusion matrix.
9. The method according to claim 8, wherein each one of the at least one confusion matrix corresponds to a different font.
10. The method of claim 1, wherein the at least one word in the document comprises a printed word appearing on a physical document.
11. The method of claim 1, wherein the at least one word in the document comprises a handwritten word appearing on a physical document.
12. The method according to claim 1, wherein the recognized word is provided by an optical character recognition technique.
13. The method according to claim 1, wherein the step c) comprises:
- i) determining for each character sequence of the misrecognized word the confusion set to which the character member occupying the character sequence belongs;
  
  ii) replacing the character member in each character sequence with each character member of the corresponding confusion set determined in step i), each replacement operation producing a different character string; and
  
  iii) eliminating from the character strings those character strings comprising non-words, the remaining character strings corresponding to the set of reference words.
14. The method according to claim 1, wherein the step h) comprises:
- i) assigning to each candidate reference word a predetermined grammar probability;
  
  ii) reducing the set of candidate reference words to each reference word having the highest grammar probability; and
  
  iii) selecting, if more than one of the set of candidate reference words are associated with the highest grammar probability, the most contextually correct candidate reference word.
15. The method according to claim 1, wherein the step h) comprises:
- i) assigning an associative weighting to each one of the set of candidate reference words; and
  
  ii) selecting the reference word with the highest associative weighting.
16. The method according to claim 15, wherein the step i) comprises:
- iii) assigning a character change weighting and a character identity weighting to each one of the plurality of confusion sets iv) obtaining a first one of the set of candidate reference words;
  
  v) determining for each character sequence of the candidate reference word the confusion set to which the character member occupying the character sequence belongs;
  
  vi) determining for each character sequence of the candidate reference word whether the character member included therein is the same as the character member of the corresponding character sequence of the misrecognized word;
  
  vii) assigning to each character sequence of the candidate reference word one of the character change weighting and the character identity weighting of the confusion set associated with the character member occupying each character sequence of the candidate reference word;
  
  viii) determining an associative weighting for the candidate reference word on the basis of the character weightings assigned to each character sequence in step vii); and
  
  ix) repeating steps v)-viii) for each candidate reference word.
17. The method according to claim 16, wherein the step viii) comprises multiplying together each of the one of the character change weightings and character identity weightings assigned to each character sequence of the candidate reference word.

18. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character member, the apparatus comprising:
- a) first means for providing a recognized word based on the word in the document;
  
  b) first means for determining whether the recognized word has been misrecognized;
  
  c) second means for providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
  
  d) third means for providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members, a content of each confusion set being independent of the recognized word;
  
  e) means for comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
  
  f) first means for eliminating any one of the set of reference words if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the reference word and if the character members from the corresponding character sequences in the misrecognized word and the reference word are not from the same confusion set, the remaining non-eliminated reference words comprising a set of candidate reference words; and
  
  g) means for selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the select candidate reference word comprising a replacement word for the misrecognized word.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31)
- - 19. The apparatus according to claim 18, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one alphanumeric character.
- 20. The apparatus according to claim 18, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one alphabetical letter.
- 21. The apparatus according to claim 18, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one typographic character.
- 22. The apparatus according to claim 18, wherein:
  - the first means for determining comprises means for determining whether the recognized word is spelled correctly.
- 23. The apparatus according to claim 18, wherein:
  - the first means for determining comprises means for determining whether the recognized word comprises at least one of a correctly spelled word, a grammatically correct word, and a contextually correct word.
- 24. The apparatus according to claim 18, wherein the confusion sets are derived from at least one confusion matrix.
- 26. The apparatus of claim 18, wherein the at least one word in the document comprises a printed word appearing on a physical document.
- 27. The apparatus of claim 18, wherein the at least one word in the document comprises a handwritten word appearing on a physical document.
- 28. The apparatus according to claim 18, wherein the recognized word is provided by an optical character recognition technique.
- 29. The apparatus according to claim 18, wherein the second means for providing comprises:
  - i) fourth means for determining for each character sequence of the misrecognized word the confusion set to which the character member occupying the character sequence belongs;
    
    ii) means for replacing the character member in each character sequence with each character member of the corresponding confusion set determined by the fourth means for determining, each replacement operation producing a different character string; and
    
    iii) second means for eliminating from the character strings those character strings comprising non-words, the remaining character strings corresponding to the set of reference words.
- 30. The apparatus according to claim 18, wherein the means for reducing comprises:
  - i) means for assigning to each candidate reference a predetermined grammar probability;
    
    ii) means for reducing the set of candidate reference words to each reference word having the highest grammar probability; and
    
    iii) means for selecting, if more than one of the set of candidate reference words are associated with the highest grammar probability, the most contextually correct candidate reference word.
- 31. The apparatus according to claim 18, wherein the means for reducing comprises:
  - i) first means for assigning an associative weighting to each one of the set of candidate reference words; and
    
    ii) means for selecting the reference word with the highest associating weighting.

25. The apparatus according to claim 25, wherein each one of the at least one confusion matrix corresponds to a different font.

32. The apparatus according to claim 32, wherein the means for assigning comprises:
- iii) second means for assigning one of a character change weighting and a character identity weighting to each one of the plurality of confusion sets iv) means for obtaining each one of the set of candidate reference words;
  
  v) fifth means for determining for each character sequence of the candidate reference word the confusion set to which the character member occupying the character sequence belongs;
  
  vi) sixth means for determining for each character sequence of the candidate reference word whether the character member included therein is the same as the character member of the corresponding character sequence of the misrecognized word;
  
  vii) third means for assigning to each character sequence of the candidate reference word one of the character change weighting and the character identity weighting of the confusion set associated with the character member occupying each character sequence of the candidate reference word; and
  
  viii) seventh means for determining an associative weighting for each candidate reference word on the basis of the character weightings assigned to each character sequence by the third means for assigning.

33. The apparatus according to claim 33, wherein the seventh means for determining comprises means for multiplying together each of the one of the character change weightings and character identity weightings assigned to each character sequence of each candidate reference word.

34. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
- a scanning device; and
  
  a processing device in communication with the scanning device, wherein the processing device comprises;
  
  a central processing unit, an optical character recognition module in communication with the centrol processing unit, a confusion matrix memory in communication with the central processing unit, a word checking module in communication with the central processing unit a confusion set generating module in communication with the central processing unit, each confusion set having a content that is independent of an input to the scanning device, a confusion set memory in communication with the central processing unit, and a word correction module in communication with the central processing unit.

35. The apparatus according to claim 35, wherein the word checking module comprises at least one of a spell checking module, a grammar checking module, and a natural language module.
- View Dependent Claims (42)
- - 42. The apparatus according to claim 35, wherein the confusion set generating module derives a plurality of confusion sets from a previously generated confusion matrix.

36. The apparatus according to claim 36, wherein the processing device further comprises:
- a data input device in communication with the central processing unit; and
  
  a display device in communication with the central processing unit.

37. An apparatus, for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
- a scanning device; and
  
  a processing device in communication with the scanning device, wherein the processing device comprises;
  
  a central processing unit, an optical character recognition unit in communication with the central processing unit, a word checking module in communication with the central processing unit, a confusion set memory in communication with the central processing unit and provided with a plurality of predetermined confusion sets, each predetermined confusion set having a content that is independent of an input to the scanning device, and a word correction module in communication with the central processing unit.

38. The apparatus according to claim 38, wherein the word checking module comprises at least one of a spell checking module, a grammar checking module, and a natural language module.
- View Dependent Claims (39, 43)
- - 39. The apparatus according to claim 38, wherein the processing device further comprises:
    - a data input device in communication with the central processing unit; and
      
      a display device in communication with the central processing unit.
  - 43. The apparatus according to claim 38, wherein the plurality of predetermined confusion sets are derived from a previously generated confusion matrix.

40. A method of recognizing at least one word in a document, the word including at least one predetermined character member, the method comprising the steps of:
- a) providing a recognized word based on the word in the document;
  
  b) determining whether the recognized word has been misrecognized;
  
  c) providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
  
  d) providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members;
  
  e) comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
  
  f) eliminating the current reference word if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the current reference word and if the character members from the corresponding character sequences in the misrecognized word and the current reference word are not from the same confusion set;
  
  g) repeating steps e) and f) for each reference word of the set of reference words, the remaining non-eliminated reference words comprising a set of candidate reference words; and
  
  h) selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the selected candidate reference word comprising a replacement word for the misrecognized word, wherein the confusion sets are derived from at least one confusion matrix, and wherein the at least one confusion matrix is generated prior to deriving the plurality of confusion sets therefrom.

41. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character member, the apparatus comprising:
- a) first means for providing a recognized word based on the word in the document;
  
  b) first means for determining whether the recognized word has been misrecognized;
  
  c) second means for providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
  
  d) third means for providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members;
  
  e) means for comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
  
  f) first means for eliminating any one of the set of reference words if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the reference word and if the character members from the corresponding character sequences in the misrecognized word and the reference word are not from the same confusion set, the remaining non-eliminated reference words comprising a set of candidate reference words; and
  
  g) means for selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the select candidate reference word comprising a replacement word for the misrecognized word, and wherein the at least one confusion matrix is generated prior to deriving the plurality of confusion sets therefrom.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Corporation (AT&T, Inc.)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Goldberg, Randy G.
Primary Examiner(s)
Tran, Phuoc
Assistant Examiner(s)
Mariam, Daniel G.

Application Number

US09/018,575
Time in Patent Office

1,139 Days
Field of Search

382/187, 382/159, 382/228, 382/309, 382/310, 382/227, 382/189, 382/181, 704/251
US Class Current

382/310
CPC Class Codes

G06V 10/98 Detection or correction of ...

G06V 30/2504 Coarse or fine approaches, ...

Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

176 Citations

43 Claims

Specification

Solutions

Use Cases

Quick Links

Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

176 Citations

43 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links