Merging three optical character recognition outputs for improved precision using a minimum edit distance function
First Claim
1. A method for determining the content of an ancestral printed text from outputs of text to data conversion devices, each of said pages of printed text being made up of lines of symbols, and each of said outputs representing text symbols within strings-of-lines, said method comprising the steps of:
- generating three separate ones of said outputs, A, B, and C, each of said three outputs being derived from a common one of said pages of said printed text, and each of said outputs being generated by an optical character recognition device;
merging said outputs A, B, and C by,determining edit distances of the strings of lines, each line considered as a single unit, between the outputs A and B, A and C, and B and C,recovering an optimal alignment between outputs A, B, and C by searching for the minimum edit distance D(A,B,C) by backtracking through the determined edit distances, andproviding a derived output representing said content of said common one of said pages of printed text, the string-of-lines said derived output representing said optimal alignment; and
recording said derived output on a memory medium.
1 Assignment
0 Petitions
Accused Products
Abstract
Three OCR systems are employed for text conversion and the results generated from each of the three are merged using a edit distance algorithm to estimate a correct common text ancestor. To make the process computationally feasible for large strings such as pages of documentation with 3,000 characters, the method is executed in two stages. The first procedure is carried out with each page considered as a string of lines. Where differences exist using the edit distance between the lines on a page to find the optimal alignment of the lines. In the event that choice must be made among three non-null lines, the procedure then is invoked on the three lines , by using the edit distance between the characters on a line to find the optimal alignment. The number of computations required of the procedure is further reduced by comer-cutting that hueristically determines an upper bound on the edit distance and limits calculations to those which do not exceed the upper bound.
-
Citations
21 Claims
-
1. A method for determining the content of an ancestral printed text from outputs of text to data conversion devices, each of said pages of printed text being made up of lines of symbols, and each of said outputs representing text symbols within strings-of-lines, said method comprising the steps of:
-
generating three separate ones of said outputs, A, B, and C, each of said three outputs being derived from a common one of said pages of said printed text, and each of said outputs being generated by an optical character recognition device; merging said outputs A, B, and C by, determining edit distances of the strings of lines, each line considered as a single unit, between the outputs A and B, A and C, and B and C, recovering an optimal alignment between outputs A, B, and C by searching for the minimum edit distance D(A,B,C) by backtracking through the determined edit distances, and providing a derived output representing said content of said common one of said pages of printed text, the string-of-lines said derived output representing said optimal alignment; and recording said derived output on a memory medium. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for determining the content of an ancestral printed text having pages composed of lines of symbols, from the outputs representing text symbols with strings-of-lines of a text to data conversion device comprising the steps of:
-
generating three of said outputs, A, B, and C, each of said outputs being derived from a common one of said pages of said ancestral printed text and each of said outputs being generated by an optical character recognition device; computing an upper bound, K, on the edit distances between strings of lines represented within said outputs by deriving bounds, KAB, KAC, and KBC, on each respective pair of said outputs A and B, A and C, and B and C and combining said bounds, KAB, KAC, and KBC to derive said upper bound, K, on the total of the edit distances in correspondence with the expression; K=min (KAB +KBC, KAC +KBC, KAB +KAC);
thenmerging said outputs, A, B, and C by determining only those edit distances of value below said upper bound, K, between the strings of lines represented within outputs A and B, A and C, and B and C, then backtracking through the determined edit distances to recover an optimal alignment exhibiting a minimum edit distance D(A,B,C) to provide a merged output representing said content of said common one of said pages of said ancestral printed text; and then recording said merged output within the memory medium of a database.
-
-
9. A method for determining the content of an ancestral printed text having pages composed of lines of symbols, from the outputs representing text symbols within strings-of-lines of a text to data conversion device comprising the steps of:
-
generating three of said outputs, A, B, and C, each of said outputs being derived from a common one of said pages of said ancestral printed text and each of said outputs being generated by an optical character recognition device; normalizing each of said outputs A, B, and C subsequent to said generation thereof by eliminating regions of said outputs not corresponding with said lines of symbols, then, merging said outputs A, B, and C by determining the edit distances between the strings of lines represented within outputs A and B, A and C, and B and C, then backtracking through the determined edit distances to recover an optimal alignment exhibiting a minimum edit distance D(A,B,C) to provide a merged output representing said content of said common one of said pages of said ancestral printed text; and then recording said merged output within the memory medium of a database.
-
-
10. A method for determining the content of an ancestral printed text having pages composed of lines of symbols, from the outputs representing text symbols within strings-of-lines of a text to data conversion device comprising the steps of:
-
generating three of said outputs, A, B, and C, each of said outputs being derived from a common one of said pages of said ancestral printed text and each of said outputs being generated by an optical character recognition device; computing an upper bound, K, on the edit distances between the strings of lines represented within said outputs by; finding the closest matching of groups of said lines of said common one of said pages by determining traces between said outputs A, B, and C, enlarging said groups to lengthen runs of said groups of lines, removing crossings representing traces extending from two lines of one output to one line of another output, then retracing to determine the number of trace based matches between said groups, and estimating the minimum edit distance between said strings-of-lines and setting the value of said upper bound, K, in correspondence therewith, then, merging said outputs A, B, and C by determing only those edit distances of value below said upper bound, K, between strings of lines represented within outputs A and B, A and C, and B and C, then backtracking through the determined edit distances to recover an optimal alignment exhibiting a minimum edit distance D(A,B,C) to provide a merged output representing said content of said common one of said pages of said ancestral printed text; and then recording said merged output within the memory medium of a database.
-
-
11. A method for determining the content of an ancestral printed text having pages composed of lines of symbols, from the outputs representing text symbols within strings-of-lines of a text to data conversion device comprising the steps of:
- generating three of said outputs, A, B, and C, each of said outputs being derived from a common one of said pages of said ancestral printed text, and each of said outputs being generated by an optical character recognition device;
computing an upper bound, K, on the edit distances between the strings of lines represented within said outputs by deriving bounds KAB, KAC, and KBC on the strings-of-lines alignment distance for each respective output pair A and B, A and C, and B and C, said upper bound, K, being for the total strings-of-lines edit distance and being derived in correspondence with the expression; K=min (KAB +KBC, KAC +KBC, KAB +KAC); merging said outputs, A, B, and C by determining only those edit distances of value below said upper bound, K, between the strings of lines represented within outputs A and B, A and C, and B and C, deriving an edit distance array as a three-dimensional matrix having faces corresponding with paired outputs A and B, A and C, and B and C, providing edit distance entries for said array which are derived by determining a lower bound for the cumulative lengths for respective lines, i, j, k of said outputs A, B, and C, and deriving an edit distance entry D(i,j,k) for said array when said lower bound is less than said upper bound, K, then backtracking through the determined edit distances to recover an optimal alignment exhibiting a minimum edit distance D(A,B,C) to provide a merged output representing said content of said common one of said pages of said ancestral printed text; and then recording said merged output within a memory medium of a database. - View Dependent Claims (12, 13, 14, 15)
- generating three of said outputs, A, B, and C, each of said outputs being derived from a common one of said pages of said ancestral printed text, and each of said outputs being generated by an optical character recognition device;
-
16. A system for generating the content of an ancestral printed text, having pages composed of lines of symbols, from outputs within strings of lines representing text symbols of an optical character recognition device, comprising:
-
a first optical character recognition device having an output, A, derived from a given page of said ancestral printed text; a second optical character recognition device having an output, B, derived from said given page of said ancestral printed text; a third optical character recognition device having an output, C, derived from said given page of said ancestral printed text; merge processing means responsive to each of said outputs A, B, and C, as said strings-of-lines, for computing an upper bound, K, on the total alignment distance exhibited by said outputs from bounds, KAB, KAC, and KBC on each respective pair of said outputs A and B, A and C, and B and C, said upper bound, K, being derived in correspondence with the expression; K=min (KAB +KBC, KAC +KBC, KAB +KAC), for deriving an edit distance array as a three-dimensional matrix having faces corresponding with paired outputs A and B, A and C, and B and C, for generating edit distance entries for said array by deriving a lower bound value with respect to said outputs and providing an edit distance entry D(i,j,k) for said array corresponding with the ith line of output A, the jth line of output B, and the kth line of output C when said lower bound value is less than said upper bound, K; for backtracking through said array to recover an optimal alignment of said strings-of-lines exhibiting a minimum edit distance to provide a merged output representing said given page of ancestral printed text; and memory means for receiving and retaining said merged output as an archival database. - View Dependent Claims (17, 18, 19, 20, 21)
-
Specification