Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
First Claim
1. A method of recognizing at least one word in a document, the word including at least one predetermined character member, the method comprising the steps of:
- a) providing a recognized word based on the word in the document;
b) determining whether the recognized word has been misrecognized;
c) providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
d) providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members, a content of each confusion set being independent of the recognized word;
e) comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
f) eliminating the current reference word if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the current reference word and if the character members from the corresponding character sequences in the misrecognized word and the current reference word are not from the same confusion set;
g) repeating steps e) and f) for each reference word of the set of reference words, the remaining non-eliminated reference words comprising a set of candidate reference words; and
h) selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the selected candidate reference word comprising a replacement word for the misrecognized word.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition (“OCR”) technique. If an incorrect word is found in the electronic document, the present invention generates at least one reference word and selects the reference word that is the most likely correct replacement for the incorrect word. This selection is accomplished by comparing each character member of every reference word to a plurality of confusion sets. On the basis of this comparison, the reference words are reduced to a smaller candidate set of reference words, from which a reference word for replacing the incorrect word is selected on the basis of predetermined criteria.
176 Citations
43 Claims
-
1. A method of recognizing at least one word in a document, the word including at least one predetermined character member, the method comprising the steps of:
-
a) providing a recognized word based on the word in the document;
b) determining whether the recognized word has been misrecognized;
c) providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
d) providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members, a content of each confusion set being independent of the recognized word;
e) comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
f) eliminating the current reference word if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the current reference word and if the character members from the corresponding character sequences in the misrecognized word and the current reference word are not from the same confusion set;
g) repeating steps e) and f) for each reference word of the set of reference words, the remaining non-eliminated reference words comprising a set of candidate reference words; and
h) selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the selected candidate reference word comprising a replacement word for the misrecognized word. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
the step b) comprises determining whether the recognized word is spelled correctly.
-
-
7. The method according to claim 1, wherein:
the step b) comprises determining whether the recognized word comprises at least one of a correctly spelled word, a grammatically correct word, and a contextually correct word.
-
8. The method according to claim 1, wherein the confusion sets are derived from at least one confusion matrix.
-
9. The method according to claim 8, wherein each one of the at least one confusion matrix corresponds to a different font.
-
10. The method of claim 1, wherein the at least one word in the document comprises a printed word appearing on a physical document.
-
11. The method of claim 1, wherein the at least one word in the document comprises a handwritten word appearing on a physical document.
-
12. The method according to claim 1, wherein the recognized word is provided by an optical character recognition technique.
-
13. The method according to claim 1, wherein the step c) comprises:
-
i) determining for each character sequence of the misrecognized word the confusion set to which the character member occupying the character sequence belongs;
ii) replacing the character member in each character sequence with each character member of the corresponding confusion set determined in step i), each replacement operation producing a different character string; and
iii) eliminating from the character strings those character strings comprising non-words, the remaining character strings corresponding to the set of reference words.
-
-
14. The method according to claim 1, wherein the step h) comprises:
-
i) assigning to each candidate reference word a predetermined grammar probability;
ii) reducing the set of candidate reference words to each reference word having the highest grammar probability; and
iii) selecting, if more than one of the set of candidate reference words are associated with the highest grammar probability, the most contextually correct candidate reference word.
-
-
15. The method according to claim 1, wherein the step h) comprises:
-
i) assigning an associative weighting to each one of the set of candidate reference words; and
ii) selecting the reference word with the highest associative weighting.
-
-
16. The method according to claim 15, wherein the step i) comprises:
-
iii) assigning a character change weighting and a character identity weighting to each one of the plurality of confusion sets iv) obtaining a first one of the set of candidate reference words;
v) determining for each character sequence of the candidate reference word the confusion set to which the character member occupying the character sequence belongs;
vi) determining for each character sequence of the candidate reference word whether the character member included therein is the same as the character member of the corresponding character sequence of the misrecognized word;
vii) assigning to each character sequence of the candidate reference word one of the character change weighting and the character identity weighting of the confusion set associated with the character member occupying each character sequence of the candidate reference word;
viii) determining an associative weighting for the candidate reference word on the basis of the character weightings assigned to each character sequence in step vii); and
ix) repeating steps v)-viii) for each candidate reference word.
-
-
17. The method according to claim 16, wherein the step viii) comprises multiplying together each of the one of the character change weightings and character identity weightings assigned to each character sequence of the candidate reference word.
-
18. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character member, the apparatus comprising:
-
a) first means for providing a recognized word based on the word in the document;
b) first means for determining whether the recognized word has been misrecognized;
c) second means for providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
d) third means for providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members, a content of each confusion set being independent of the recognized word;
e) means for comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
f) first means for eliminating any one of the set of reference words if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the reference word and if the character members from the corresponding character sequences in the misrecognized word and the reference word are not from the same confusion set, the remaining non-eliminated reference words comprising a set of candidate reference words; and
g) means for selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the select candidate reference word comprising a replacement word for the misrecognized word. - View Dependent Claims (19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31)
20.The apparatus according to claim 18, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one number. -
20. The apparatus according to claim 18, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one alphabetical letter.
-
21. The apparatus according to claim 18, wherein each one of the at least one word in the document, the recognized word, and each reference word comprises at least one typographic character.
-
22. The apparatus according to claim 18, wherein:
the first means for determining comprises means for determining whether the recognized word is spelled correctly.
-
23. The apparatus according to claim 18, wherein:
the first means for determining comprises means for determining whether the recognized word comprises at least one of a correctly spelled word, a grammatically correct word, and a contextually correct word.
-
24. The apparatus according to claim 18, wherein the confusion sets are derived from at least one confusion matrix.
-
26. The apparatus of claim 18, wherein the at least one word in the document comprises a printed word appearing on a physical document.
-
27. The apparatus of claim 18, wherein the at least one word in the document comprises a handwritten word appearing on a physical document.
-
28. The apparatus according to claim 18, wherein the recognized word is provided by an optical character recognition technique.
-
29. The apparatus according to claim 18, wherein the second means for providing comprises:
-
i) fourth means for determining for each character sequence of the misrecognized word the confusion set to which the character member occupying the character sequence belongs;
ii) means for replacing the character member in each character sequence with each character member of the corresponding confusion set determined by the fourth means for determining, each replacement operation producing a different character string; and
iii) second means for eliminating from the character strings those character strings comprising non-words, the remaining character strings corresponding to the set of reference words.
-
-
30. The apparatus according to claim 18, wherein the means for reducing comprises:
-
i) means for assigning to each candidate reference a predetermined grammar probability;
ii) means for reducing the set of candidate reference words to each reference word having the highest grammar probability; and
iii) means for selecting, if more than one of the set of candidate reference words are associated with the highest grammar probability, the most contextually correct candidate reference word.
-
-
31. The apparatus according to claim 18, wherein the means for reducing comprises:
-
i) first means for assigning an associative weighting to each one of the set of candidate reference words; and
ii) means for selecting the reference word with the highest associating weighting.
-
-
-
25. The apparatus according to claim 25, wherein each one of the at least one confusion matrix corresponds to a different font.
-
32. The apparatus according to claim 32, wherein the means for assigning comprises:
-
iii) second means for assigning one of a character change weighting and a character identity weighting to each one of the plurality of confusion sets iv) means for obtaining each one of the set of candidate reference words;
v) fifth means for determining for each character sequence of the candidate reference word the confusion set to which the character member occupying the character sequence belongs;
vi) sixth means for determining for each character sequence of the candidate reference word whether the character member included therein is the same as the character member of the corresponding character sequence of the misrecognized word;
vii) third means for assigning to each character sequence of the candidate reference word one of the character change weighting and the character identity weighting of the confusion set associated with the character member occupying each character sequence of the candidate reference word; and
viii) seventh means for determining an associative weighting for each candidate reference word on the basis of the character weightings assigned to each character sequence by the third means for assigning.
-
-
33. The apparatus according to claim 33, wherein the seventh means for determining comprises means for multiplying together each of the one of the character change weightings and character identity weightings assigned to each character sequence of each candidate reference word.
-
34. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
-
a scanning device; and
a processing device in communication with the scanning device, wherein the processing device comprises;
a central processing unit, an optical character recognition module in communication with the centrol processing unit, a confusion matrix memory in communication with the central processing unit, a word checking module in communication with the central processing unit a confusion set generating module in communication with the central processing unit, each confusion set having a content that is independent of an input to the scanning device, a confusion set memory in communication with the central processing unit, and a word correction module in communication with the central processing unit.
-
- 35. The apparatus according to claim 35, wherein the word checking module comprises at least one of a spell checking module, a grammar checking module, and a natural language module.
-
36. The apparatus according to claim 36, wherein the processing device further comprises:
-
a data input device in communication with the central processing unit; and
a display device in communication with the central processing unit.
-
-
37. An apparatus, for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
-
a scanning device; and
a processing device in communication with the scanning device, wherein the processing device comprises;
a central processing unit, an optical character recognition unit in communication with the central processing unit, a word checking module in communication with the central processing unit, a confusion set memory in communication with the central processing unit and provided with a plurality of predetermined confusion sets, each predetermined confusion set having a content that is independent of an input to the scanning device, and a word correction module in communication with the central processing unit.
-
- 38. The apparatus according to claim 38, wherein the word checking module comprises at least one of a spell checking module, a grammar checking module, and a natural language module.
-
40. A method of recognizing at least one word in a document, the word including at least one predetermined character member, the method comprising the steps of:
-
a) providing a recognized word based on the word in the document;
b) determining whether the recognized word has been misrecognized;
c) providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
d) providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members;
e) comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
f) eliminating the current reference word if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the current reference word and if the character members from the corresponding character sequences in the misrecognized word and the current reference word are not from the same confusion set;
g) repeating steps e) and f) for each reference word of the set of reference words, the remaining non-eliminated reference words comprising a set of candidate reference words; and
h) selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the selected candidate reference word comprising a replacement word for the misrecognized word, wherein the confusion sets are derived from at least one confusion matrix, and wherein the at least one confusion matrix is generated prior to deriving the plurality of confusion sets therefrom.
-
-
41. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character member, the apparatus comprising:
-
a) first means for providing a recognized word based on the word in the document;
b) first means for determining whether the recognized word has been misrecognized;
c) second means for providing, if the recognized word has been misrecognized, a set of reference words, each reference word comprising a different set of predetermined character members;
d) third means for providing a plurality of confusion sets, each confusion set grouping together a different plurality of character members;
e) means for comparing at least one character sequence of the misrecognized word with a corresponding character sequence of a current one of the set of reference words to determine which corresponding character sequences do not include the same character members;
f) first means for eliminating any one of the set of reference words if the character member of any character sequence of the misrecognized word does not correspond to the character member of the corresponding character sequence in the reference word and if the character members from the corresponding character sequences in the misrecognized word and the reference word are not from the same confusion set, the remaining non-eliminated reference words comprising a set of candidate reference words; and
g) means for selecting one of the set of candidate reference words in accordance with a set of predetermined criteria, the select candidate reference word comprising a replacement word for the misrecognized word, and wherein the at least one confusion matrix is generated prior to deriving the plurality of confusion sets therefrom.
-
Specification