METHOD AND SYSTEM FOR VERIFICATION OF UNCERTAINLY RECOGNIZED WORDS IN AN OCR SYSTEM
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method and system for confirming uncertainly recognized words as reported by an Optical Character Recognition process by using spelling alternatives as search arguments for an Internet search engine. The measured number of hits for each spelling alternative is used to provide a confirmation measure for the most probable spelling alternative. Whenever the confirmation measure is inconclusive, a plurality of search strategies are used to reach a measured result comprising zero hits except for one spelling alternative that is used as the correct alternative.
41 Citations
113 Claims
-
1-57. -57. (canceled)
-
58. A method for resolving contradicting output data from an Optical Character Recognition (OCR) system, wherein the output data comprises at least one word with at least one uncertainly recognized character, wherein the at least one uncertainly recognized character is reported in the output data together with probable alternatives for the at least one uncertainly recognized character, and the words wherein the at least one uncertainly recognized character has been encountered in an image of a text being processed by the OCR system, the method comprises the steps of:
using an Internet search engine with search arguments established according to a search strategy comprising; a) providing initial search arguments by forming spelling alternatives for the words comprising the at least one uncertainly recognized character by substituting the at least one uncertainly recognized character with the reported probable alternatives for the at least one character, one by one, and in possible combinations in each encountered word, or by removing a character, thereby forming a plurality of spelling alternatives, and then measuring and recording number of hits for search results of each respective spelling alternative that has been formed in this manner, b) comparing the measured number of hits for each of the spelling alternatives with an upper predefined relative threshold level and a lower predefined relative threshold level, wherein each of the respective comparisons of the plurality of measurements falls into one of three possible outcomes; i) if the measurement of a spelling alternative is above the predefined relative upper threshold level, the corresponding spelling alternative for this measurement is the correct spelling alternative for the word, and terminating the Internet search, ii) if the measurement of a spelling alternative is below the lower predefined relative threshold level, the corresponding spelling alternative for this measurement is deemed non-existing, and the word with this spelling alternative is discarded from further investigations, and continuing with other spelling alternatives that has been formed as search arguments for the Internet search engine, iii) if the measurement of a spelling alternative falls between the upper relative threshold level and the lower relative threshold level, exiting the Internet search engine and modifying the search strategy providing further search arguments as a combination of members of the remaining spelling alternatives and other words encountered in the document, other character alternatives for the at least one uncertainly recognized character, phrases, adapting the upper relative threshold level, adapting the lower relative threshold level, and/or other information related to the output data from the OCR system, before continuing using the search strategy providing further measurements and comparisons for resolving the contradicting output data, c) continuing processing step b) a number of predefined times, or until there is only one spelling alternative left, which ever occurs first, providing an iteration amongst a plurality of different search arguments used in the search strategy before terminating step b), and using the remaining spelling alternative having the highest measurement above the upper relative threshold level as the correct spelling alternative. - View Dependent Claims (59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85)
-
86. A system for resolving contradicting output data from an Optical Character Recognition (OCR) system, wherein the output data comprises at least one word with at least one uncertainly recognized character, wherein the at least one uncertainly recognized character is reported in the output data together with probable alternatives for the at least one uncertainly recognized character, and the words wherein this at least one uncertainly recognized character has been encountered in an image of a text being processed by the OCR system, the system comprises:
-
a system component using an Internet search engine with search arguments established according to a search strategy comprising; a) the system component provides initial search arguments by forming spelling alternatives for the words comprising the at least one uncertainly recognized character by substituting the at least one uncertainly recognized character with the reported probable alternatives for the at least one character, one by one, and in possible combinations in each encountered word, or by removing a character, thereby forming a plurality of spelling alternatives, and then measuring and recording number of hits for search results of each respective spelling alternative that has been formed in this manner, b) the system component compares the measured number of hits for each of the spelling alternatives with an upper predefined relative threshold level and a lower predefined relative threshold level, wherein each of the respective comparisons of the plurality of measurements falls into one of three possible outcomes; i) if the measurement of a spelling alternative is above the predefined relative upper threshold level, the corresponding spelling alternative for this measurement is the correct spelling alternative for the word, and terminate the Internet search, ii) if the measurement of a spelling alternative is below the lower predefined relative threshold level, the corresponding spelling alternative for this measurement is deemed non-existing, and the word with this spelling alternative is discarded from further investigations, and continue with other spelling alternatives that has been formed as search arguments for the Internet search engine, iii) if the measurement of a spelling alternative falls between the upper relative threshold level and the lower relative threshold level, exit the Internet search engine and modify the search strategy providing further search arguments as a combination of members of the remaining spelling alternatives and other words encountered in the document, other character alternatives for the at least one uncertainly recognized character, phrases, adapting the upper relative threshold level, adapting the lower relative threshold level, and/or other information related to the output data from the OCR system, before continuing using the search strategy providing further measurements and comparisons for resolving the contradicting output data, c) the system component is processing step b) a number of predefined times, or until there is only one spelling alternative left, which ever occurs first, providing an iteration amongst a plurality of different search arguments used in the search strategy before terminating step b), and using the remaining spelling alternative having the highest measurement above the upper relative threshold level as the correct spelling alternative. - View Dependent Claims (87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113)
-
Specification