System and method for improved string matching under noisy channel conditions
First Claim
1. A computer-readable medium having computer-executable components for locating a query string in a document image file, comprising:
- a search component in communication with an image file and being configured to transform the image file into a textual file, the textual file including textual data corresponding to graphical representations of the textual data within the image file; and
a confusion table identifying errors that could occur during the transformation of the image file to the textual file, each error in the confusion table having an associated likelihood that the error would occur, wherein the search component is configured to locate instances of a query string within the textual file by comparing the query string to a candidate string in the textual file and determining a probability that the candidate string matches the query string and using the confusion table.
2 Assignments
0 Petitions
Accused Products
Abstract
Described is a system and method for improving string matching in a noisy channel environment. The invention provides a method for identifying string candidates and analyzing the probability that the string candidate matches a user-defined string. In one implementation, a find engine receives a query string, converts an image file into a textual file, and identifies each instance of the query string in the textual file. The find engine identifies candidates within the textual file that may match the query string. The find engine refers to a confusion table to help identify whether candidates that are near matches to the query string are actually matches to the query string but for a common recognition error. Candidates meeting a probability threshold are identified as matches to the query string. The invention further provides for analysis options including word heuristics, language models, and OCR confidences.
51 Citations
25 Claims
-
1. A computer-readable medium having computer-executable components for locating a query string in a document image file, comprising:
-
a search component in communication with an image file and being configured to transform the image file into a textual file, the textual file including textual data corresponding to graphical representations of the textual data within the image file; and
a confusion table identifying errors that could occur during the transformation of the image file to the textual file, each error in the confusion table having an associated likelihood that the error would occur, wherein the search component is configured to locate instances of a query string within the textual file by comparing the query string to a candidate string in the textual file and determining a probability that the candidate string matches the query string and using the confusion table. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-readable medium, having computer executable-instructions for performing steps, comprising:
-
receiving a request to locate instances of a query string in a document image file;
transforming the document image file into a document text file using a recognition process;
parsing the document text file to identify a candidate data string that differs from the query string by less than a predetermined factor; and
analyzing the candidate data string to identify a probability that the candidate data string matches the query string. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
comparing a portion of the candidate data string to a table associating error strings with correct strings, wherein the error strings result from the recognition process misrecognizing the correct strings; and
if the portion of the candidate data string corresponds to an error string, determining if replacing the portion with the correct string corresponding to the error string increases the probability that the candidate data string matches the query string, and if so, identifying the candidate data string as a match.
-
-
17. The computer-readable medium of claim 16, wherein each error string/correct string pair has an associated probability that describes a likelihood that the correct string was misrecognized as the error string, and wherein determining if replacing the portion increases the probability that the candidate matches the query string is based on the likelihood that the correct string was misrecognized as the error string.
-
18. The computer-readable medium of claim 16, wherein analyzing the candidate data string further comprises evaluating whether a confidence value associated with the portion of the candidate data string affects the probability that the candidate data string matches the query string, the confidence value quantifying a confidence that characters in the document text file are an accurate representation of the corresponding characters in the document image file.
-
19. The computer-readable medium of claim 16, wherein analyzing the candidate data string further comprises including word heuristics in the analysis.
-
20. The computer-readable medium of claim 16, wherein analyzing the candidate data string further comprises including process language models in the analysis.
-
21. A computer-implemented method for locating strings in a document, comprising:
-
receiving a request to locate instances of a query string in a document image file;
transforming the document image file into a document text file using a recognition process;
performing a fast approximate string match on the document text file to identify a candidate data string that differs from the query string by less than a predetermined factor; and
analyzing the candidate data string to identify a probability that the candidate data string matches the query string. - View Dependent Claims (22, 23, 24, 25)
comparing a portion of the candidate data string to a confusion table that associates error strings with correct strings, wherein the error strings result from the recognition process misrecognizing the correct strings, and further wherein each error string/correct string pair has an associated probability that describes a likelihood that the correct string was misrecognized as the error string; and
if the portion of the candidate data string corresponds to an error string, determining if replacing the portion with the correct string corresponding to the error string increases the probability that the candidate data string matches the query string, and if so, identifying the candidate data string as a match.
-
-
23. The computer-implemented method of claim 21, wherein analyzing the candidate data string comprises evaluating whether a confidence value associated with the portion of the candidate data string affects the probability that the candidate data string matches the query string, the confidence value quantifying a confidence that characters in the document text file are an accurate representation of the corresponding characters in the document image file.
-
24. The computer-readable medium of claim 21, wherein analyzing the candidate data string comprises including word heuristics in the analysis.
-
25. The computer-readable medium of claim 21, wherein analyzing the candidate data string comprises including process language models in the analysis.
Specification