Method and apparatus for performing an automatic correction of misrecognized words produced by an optical character recognition technique by using a Hidden Markov Model based algorithm
First Claim
1. A method of recognizing at least one word in a document, the word including at least one predetermined character, the method comprising thea) providing a recognized word based on the word in the document;
- b) determining whether the recognized word matches the word in the document, the recognized word comprising a misrecognized word in the absence of a match with the word in the document;
c) if the determining step produces the misrecognized word, generating a set of reference words, wherein each one of the reference words comprises a different set of characters;
d) determining for each of the reference words a corresponding replacement word value;
e) prior to selecting the reference word for replacing the misrecognized word, reducing an amount of reference words from the set of reference words in order to form a subset of reference words from the set of reference words by eliminating according to an elimination operation any reference word that does not belong to a predetermined vocabulary, the subset of reference words being limited to only those reference words that belong to the predetermined vocabulary, the elimination operation being different than the generating of the set of reference words; and
f) selecting from the subset of reference words one reference word for replacing the misrecognized word based on the replacement word values of the corresponding reference words in the subset of the reference words.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition (“OCR”) technique. Each recognized word is generated by first producing, for each character position of the corresponding word in the original document, the N-best characters for occupying that character position. If an incorrect word is found in the electronic document, the present invention generates a plurality of reference words from which one is selected for replacing the incorrect word. This selected reference word is determined by the present invention to be the reference word that is the most likely correct replacement for the incorrect recognized word. This selection is accomplished by computing for each reference word a replacement word value. The reference word that is selected to replace the incorrect recognized word corresponds to the highest replacement word value.
-
Citations
35 Claims
-
1. A method of recognizing at least one word in a document, the word including at least one predetermined character, the method comprising the
a) providing a recognized word based on the word in the document; -
b) determining whether the recognized word matches the word in the document, the recognized word comprising a misrecognized word in the absence of a match with the word in the document;
c) if the determining step produces the misrecognized word, generating a set of reference words, wherein each one of the reference words comprises a different set of characters;
d) determining for each of the reference words a corresponding replacement word value;
e) prior to selecting the reference word for replacing the misrecognized word, reducing an amount of reference words from the set of reference words in order to form a subset of reference words from the set of reference words by eliminating according to an elimination operation any reference word that does not belong to a predetermined vocabulary, the subset of reference words being limited to only those reference words that belong to the predetermined vocabulary, the elimination operation being different than the generating of the set of reference words; and
f) selecting from the subset of reference words one reference word for replacing the misrecognized word based on the replacement word values of the corresponding reference words in the subset of the reference words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 32, 34)
i) providing, for every character position of the word in the document, a set of N-best characters for occupying that character position, wherein N is an integer greater than zero; and
ii) generating the recognized word based on the set of N-best characters provided for every character position of the word in the document.
-
-
8. The method according to claim 7, wherein the step b) comprises determining whether the recognized word is spelled correctly, and wherein each of the reference words is determined based on the set of N-best characters provided for every character position of the word in the document.
-
9. The method according to claim 8, further comprising the step of associating a character value with each one of the N-best characters provide for every character position of the word in the document.
-
10. The method according to claim 9, wherein the step d) comprises, for each reference word, determining a reference word recognition value based on each of the character values associated with each of the N-best characters used to form the reference word, each replacement word value for each reference word comprising a corresponding reference word recognition word value.
-
11. The method according to claim 10, wherein each reference word recognition value is determined by multiplying the character values obtained for the corresponding reference word.
-
12. The method according to claim 11, wherein the step e) comprises including in the subset of reference words only those reference words that are present in a dictionary.
-
13. The method according to claim 12, wherein the reference word selected in step f) corresponds to the reference word associated with a highest replacement word value.
-
14. The method according to claim 13, wherein the reference word selected to replace the misrecognized word in step f) has a replacement word value at least as great as a predetermined threshold.
-
15. The method according to claim 9, wherein for each reference word the step d) comprises:
-
i) determining a reference word recognition value based on each of the character values associated with each of the N-best characters used to form the reference word, ii) generating a plurality of grammar values based on the misrecognized word, each grammar value corresponding to a different part of speech, and iii) determining for the reference word a composite value based on the corresponding reference word recognition value and on one of the plurality of grammar values corresponding to the part of speech of the reference word, each replacement word value comprising a corresponding composite value.
-
-
16. The method according to claim 15, wherein the step b) further comprises:
-
i) determining, if the recognized word is correctly spelled, whether the recognized word is grammatically correct; and
ii) determining, if the recognized word is spelled correctly and is grammatically correct, whether the recognized word is contextually correct.
-
-
17. The method according to claim 15, wherein each reference word recognition value is determined by multiplying together each of the character values associated with each of the N-best characters used to form the reference word, and wherein each composite value is determined by multiplying each reference word recognition value with one of the plurality of grammar values associated with the part of speech of the corresponding reference word.
-
18. The method according to claim 9, wherein each character value is a probability value.
-
19. The method according to claim 1, wherein the at least one word in the document comprises a printed word appearing on a physical document.
-
20. The method according to claim 1, wherein the at least one word in the document comprises a handwritten word appearing on a physical document.
-
21. The method according to claim 1, wherein the recognized word is provided by an optical character recognition technique.
-
32. The method according to claim 1, wherein:
-
the step a) includes the step of generating for each character position of the word in the document a set of characters, each word in the document being associated with a different group of sets of characters, and each reference word associated with the misrecognized word is derived from the group of sets of characters generated for the misrecognized word.
-
-
34. The method according to claim 1, wherein the step e) includes the step of:
i) eliminating from the set of reference words each reference word that does not satisfy a predetermined criterion, the subset of reference words including each non-eliminated reference word.
-
22. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
-
a) means for providing a recognized word based on the word in the document;
b) first determining means for determining whether the recognized word matches the word in the document, the recognized word comprising a misrecognized word in the absence of a match with the word in the document;
c) means for generating a set of reference words if the first determining means produces the misrecognized word, wherein each one of the reference words comprises a different set of characters;
d) second determining means for determining for each of the reference words a corresponding replacement word value;
e) means for reducing, prior to selecting the reference word for replacing the misrecognized word, an amount of reference words from the set of reference words in order to form a subset of reference words from the set of reference words by eliminating according to an elimination operation any reference word that does not belong to a predetermined vocabulary, the subset of reference words being limited to only those reference words that belong to the the determined vocabulary, the elimination operation being different than the generating of the set of reference words; and
f) means for selecting from the subset of reference words one reference word for replacing the misrecognized word based on the replacement word values of the corresponding reference words in the subset of the reference words. - View Dependent Claims (23, 24, 25, 26, 27, 28, 33, 35)
i) means for providing, for every character position of the word in the document, a set of N-best characters for occupying that character position, wherein N is an integer greater than zero; and
ii) means for generating the recognized word based on the set of N-best characters provided for every character position of the word in the document.
-
-
24. The apparatus according to claim 23, wherein the first determining means comprises means for determining whether the recognized word is spelled correctly, wherein each of the reference words is determined based on the set of N-best characters provided for every character position of the word in the document.
-
25. The apparatus according to claim 24, further comprising means for associating a character value with each one of the N-best characters provide for every character position of the word in the document.
-
26. The apparatus according to claim 25, wherein the second determining comprises, for each reference word, means for determining a reference word recognition value based on each of the character values associated with each of the N-best characters used to form the reference word, each replacement word value for each reference word comprising a corresponding reference word recognition word value.
-
27. The apparatus according to claim 25, wherein the second determining means comprises:
-
i) means for determining a reference word recognition value based on each of the character values associated with each of the N-best characters used to form the reference word, ii) means for generating a plurality of grammar values based on the incorrect recognized word, each grammar value corresponding to a different part of speech, and iii) means for determining for the reference word a composite value based on the corresponding reference word recognition value and on one of the plurality of grammar values corresponding to the part of speech of the reference word, each replacement word value comprising a corresponding composite value.
-
-
28. The apparatus according to claim 27, wherein the first determining means comprises:
-
i) means for determining, if the recognized word is correctly spelled, whether the recognized word is grammatically correct; and
ii) means for determining, if the recognized word is spelled correctly and is grammatically correct, whether the recognized word is contextually correct.
-
-
33. The apparatus according to claim 22, wherein:
-
the means for providing the recognized word includes means for generating for each character position of the word in the document a set of characters, each word in the document being associated with a different group of sets of characters, and each reference word associated with the misrecognized word is derived from the group of sets of characters generated for the misrecognized word.
-
-
35. The apparatus according to claim 22, wherein the means for reducing the amount of reference words includes:
i) means for eliminating from the set of reference words each reference word that does not satisfy a predetermined criterion, the subset of reference words including each non-eliminated reference word.
-
29. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
-
a scanning device; and
a processing device in communication with the scanning device, wherein the processing device comprises;
a central processing unit, an optical character recognition module in communication with the control processing unit, a character memory in communication with the central processing unit, a word checking module in communication with the central processing unit, and a word correction module in communication with the central processing unit for generating a set of reference words if the processing device misrecognizes the word in the document, wherein before the processing device selects one of the reference words as a replacement for the misrecognized word, the processing device reduces an amount of reference words from the set of reference words in order to form a subset of reference words from the set of reference words by eliminating any reference word that does not belong to predetermined vocabulary, the subset of reference words being limited to only those reference words that belong to the predetermined vocabulary, and wherein the processing device selects the replacement for the misrecognized word from the subset of reference words. - View Dependent Claims (30, 31)
a data input device in communication with the central processing unit; and
a display device in communication with the central processing unit.
-
Specification