Confusion matrix based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
First Claim
1. A method of recognizing at least one word in a document, the word including at least one predetermined character, the method comprising the steps of:
- a) providing a recognized word based on the word in the document;
b) determining whether the recognized word is correct;
c) generating, if the recognized word is incorrect, at least one reference word, each reference word comprising a different set of predetermined characters and being generated according to a process that is independent of a content of at least one confusion matrix;
d) determining for each reference word a corresponding replacement word value based on a calculation involving a mathematical function being applied to a content of the at least one confusion matrix; and
e) replacing the incorrect recognized word with the reference word most likely matching the at least one word in the document based on the corresponding replacement word value.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for correcting misrecognized words appearing in electronic documents that have been generated by scanning an original document in accordance with an optical character recognition ("OCR") technique. If an incorrect word is found in the electronic document, the present invention generates at least one reference word and selects the reference word that is the most likely correct replacement for the incorrect word. This selection is accomplished by performing a probabilistic determination that assigns to each reference word a replacement word recognition probability. The probabilistic determination is carried out on the basis of a pre-stored confusion matrix that stores a plurality of probability values. The confusion matrix is used to associate each character of recognized word in the electronic document with a corresponding character of a word in the original document on the basis of these probability values.
116 Citations
33 Claims
-
1. A method of recognizing at least one word in a document, the word including at least one predetermined character, the method comprising the steps of:
-
a) providing a recognized word based on the word in the document; b) determining whether the recognized word is correct; c) generating, if the recognized word is incorrect, at least one reference word, each reference word comprising a different set of predetermined characters and being generated according to a process that is independent of a content of at least one confusion matrix; d) determining for each reference word a corresponding replacement word value based on a calculation involving a mathematical function being applied to a content of the at least one confusion matrix; and e) replacing the incorrect recognized word with the reference word most likely matching the at least one word in the document based on the corresponding replacement word value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method of recognizing at least one word in a document, the word including at least one predetermined character, the method comprising the steps of:
-
a) providing a recognized word based on the word in the document; b) determining whether the recognized word is correct; c) providing, if the recognized word is incorrect, at least one reference word, each reference word comprising a different set of predetermined characters; d) determining for each reference word a corresponding replacement word value based on a calculation derived from a content of at least one confusion matrix; and e) replacing the incorrect recognized word with the reference word most likely matching the at least one word in the document based on the corresponding replacement word value, wherein; the at least one confusion matrix includes a plurality of character values the step d) comprises, if the recognized word is incorrect; i) obtaining for each character sequence in the at least one reference word, a character value representing a likelihood that the character sequence of the at least one reference word is recognized as the corresponding character sequence in the incorrect recognized word, wherein each of the character sequence of the at least one reference word and the character sequence of the incorrect recognized word comprises at least one character position, ii) establishing a reference word recognition value based on the obtained character values, iii) generating a plurality of grammar values based on the incorrect recognized word, each grammar value corresponding to a different part of speech, iv) determining for the reference word a composite value based on the corresponding reference word recognition value and on one of the plurality of grammar values corresponding to the part of speech of the reference word, and v) repeating steps i)-iv) for each reference word, each reference word being associated with a corresponding composite value, and each replacement word value comprising a corresponding composite value, each reference word recognition value is established by multiplying the character values obtained for the corresponding reference word, and each composite value is determined by multiplying each reference word recognition value with one of the plurality of grammar values corresponding to the part of speech of the associated reference word.
-
-
19. A method of recognizing at least one word in a document, the word including at least one predetermined character, the method comprising the steps of:
-
a) providing a recognized word based on the word in the document, b) determining whether the recognized word is correct, c) providing, if the recognized word is incorrect, at least one reference word, each reference word comprising a different set of predetermined characters, d) determining for each reference word a corresponding replacement word value based on a calculation derived from a content of at least one confusion matrix; and e) replacing the incorrect recognized word with the reference word most likely matching the at least one word in the document based on the corresponding replacement word value, wherein before the step e), the method further comprises the step of determining whether any of the at least one reference word is associated with a replacement word value higher than a predetermined threshold, and wherein the reference word used to replace the incorrect recognized word in step e) is associated with a replacement word value that is higher than the predetermined threshold.
-
-
20. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
-
means for providing a recognized word based on the word in the document; first determining means for determining whether the recognized word is correct; means for generating if the recognized word is incorrect, at least one reference word, each reference word comprising a different set of predetermined characters and being generated according to a process that is independent of a content of at least one confusion matrix; second determining means for determining for each reference word a corresponding replacement word value based on a calculation involving a mathematical function being applied to a content of the at least one confusion matrix; and means for replacing the incorrect recognized word with the reference word most likely matching the at least one word in the document based on the corresponding replacement word value. - View Dependent Claims (21, 22, 23, 24, 25, 26)
-
-
27. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character, the apparatus comprising:
-
a scanning device for providing a recognized word based on the word in the document; and a processing device in communication with the scanning device, wherein the processing device comprises; a central processing unit, an optical character recognition module in communication with the control processing unit, a confusion matrix memory in communication with the central processing unit, a word checking module in communication with the central processing unit and for generating a reference word according to a process that is independent of a content of the confusion matrix memory, and a word correction module in communication with the central processing unit, wherein, if the central processing unit operating in accordance with the word checking module determines that the recognized word is incorrect, the central processing unit, operating in accordance with the word correction module and a content of the confusion matrix memory, replaces the recognized with a reference word associated with a replacement word value generated in accordance with a calculation involving a mathematical function being applied to the content of the confusion matrix memory. - View Dependent Claims (28, 29)
-
-
30. An apparatus for recognizing at least one word in a document, the word including at least one predetermined character the apparatus comprising:
-
a scanning device for providing a recognized word based on the word in the document; and a processing device in communication with the scanning device, wherein the processing device comprises; a central processing unit, an optical character recognition module in communication with the control processing unit, a confusion matrix memory in communication with the central processing unit, a word checking module in communication with the central processing unit, a word correction module in communication with the central processing unit, a data input device in communication with the central processing unit, and a display device in communication with the central processing unit, wherein; if the central processing unit operating in accordance with the word checking module determines that the recognized word is incorrect, the central processing unit, operating in accordance with the word correction module and a content of the confusion matrix memory replaces the recognized with a reference word associated with a replacement word value generated in accordance with the content of the confusion matrix memory, the word checking module comprises at least one of a spell checking module, a grammar checking module, and a natural language module, the confusion matrix memory stores a plurality of confusion matrices, the data input device includes a selecting device for selecting one of the plurality of confusion matrices, and the reference word is generated in accordance with the selected confusion matrix.
-
-
31. A method of recognizing at least one word in an original document, the word including at least one predetermined character, the method comprising the steps of:
-
a) providing a first displayable electronic document including at least a recognized word based on the word in the original document; b) applying a word error detection application to the first displayable electronic document to determine whether the recognized word is correct; and c) if the recognized word is incorrect, performing the steps of; d) providing at least one reference word, each reference word comprising a different set of predetermined characters, e) determining for each reference word a corresponding replacement word value, and f) providing a second displayable electronic document in which the incorrect recognized word is replaced with the reference word most likely matching the at least one word in the original document based on the corresponding replacement word value.
-
-
32. An apparatus for recognizing at least one word in an original document, the word including at least one predetermined character, the apparatus comprising:
-
means for providing a first displayable electronic document including at least a recognized word based on the word in the original document; means for applying a word error detection application to the first displayable electronic document to determine whether the recognized word is correct; means for providing, if the recognized word is incorrect, at least one reference word, each reference word comprising a different set of predetermined characters; second determining means for determining, if the recognized word is incorrect, for each reference word a corresponding replacement word value; and means for providing, if the recognized word is incorrect, a second displayable electronic document in which the incorrect recognized word is replaced with the reference word most likely matching the at least one word in the original document based on the corresponding replacement word value.
-
-
33. An apparatus for recognizing at least one word in an original document, the word including at least one predetermined character, the apparatus comprising:
-
a scanning device for producing at least a first displayable electronic document including at least a recognized word based on the word in the original document; and a processing device in communication with the scanning device, wherein the processing device comprises; a central processing unit, an optical character recognition module in communication with the control processing unit, a confusion matrix memory in communication with the central processing unit, a word checking module in communication with the central processing unit, and a word correction module in communication with the central processing unit, wherein, if the central processing unit operating in accordance with the word checking module determines that the recognized word in the first displayable electronic document is incorrect, the central processing unit, operating in accordance with the word correction module and a content of the confusion matrix memory, causes the scanning device to produce a second displayable electronic document in which the incorrect recognized word is replaced with a reference word generated in accordance with the content of the confusion matrix memory.
-
Specification