Extracting information from symbolically compressed document images
First Claim
Patent Images
1. A method comprising:
- representing an input document image as a symbolically compressed representation with a sequence of template identifiers;
replacing the template identifiers with alphabet characters according to language statistics to generate a text string representative of text in the input document image; and
extracting conditional n-gram indexing terms from the text string by selecting alphabet characters in the text stream that satisfy a predicate that indicates a subset of combinations of characters in the text string.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for extracting information from symbolically compressed document images. A deciphering module generates first and second text strings by deciphering respective sequences of template identifiers in first and second symbolically compressed document images. A conditional n-gram module receives the first and second text strings from the deciphering module and extracts n-gram terms therefrom based on a predicate condition. A comparison module generates a measure of similarity between the first and second symbolically compressed document images based on the n-gram terms extracted by the conditional n-gram module.
-
Citations
41 Claims
-
1. A method comprising:
-
representing an input document image as a symbolically compressed representation with a sequence of template identifiers;
replacing the template identifiers with alphabet characters according to language statistics to generate a text string representative of text in the input document image; and
extracting conditional n-gram indexing terms from the text string by selecting alphabet characters in the text stream that satisfy a predicate that indicates a subset of combinations of characters in the text string. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A document processing system comprising:
-
a deciphering module to generate a first text string based on a sequence of template identifiers in a first symbolically compressed document image and to generate a second text string based on a sequence of template identifiers in a second symbolically compressed document image;
a conditional n-gram module coupled to receive the first and second text strings from the deciphering module, the conditional n-gram module being configured to extract n-gram indexing terms from the first and second text strings based on a predicate condition that indicates a subset of combinations of characters in the text string; and
a comparison module to generate a measure of similarity between the first and the second symbolically compressed document image based on the indexing terms extracted by the conditional n-gram module. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
- 38. An article of manufacture including one or more computer-readable media that embody a program of instructions to generate a text string from an input document image as a symbolically compressed representation with a sequence of template identifiers, wherein the program of instructions, when executed by one or more processors in one or more processing systems, causes the one or more processors to replace the template identifiers with alphabet characters according to language statistics to generate a text string representative of text in the input document image and to extract one or more conditional n-gram indexing structures from the text string by selecting alphabet characters in the text string that satisfy a predicate that indicates a subset of combinations of characters in the text string.
Specification