EXTRACTING INFORMATION FROM SYMBOLICALLY COMPRESSED DOCUMENT IMAGES
First Claim
Patent Images
1. A method comprising:
- representing an input document image with a sequence of template identifiers to reduce storage consumed by the input document image; and
replacing the template identifiers with alphabet characters according to language statistics to generate a text string representative of text in the input document image.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for extracting information from symbolically compressed document images. A deciphering module generates first and second text strings by deciphering respective sequences of template identifiers in first and second symbolically compressed document images. A conditional n-gram module receives the first and second text strings from the deciphering module and extracts n-gram terms therefrom based on a predicate condition. A comparison module generates a measure of similarity between the first and second symbolically compressed document images based on the n-gram terms extracted by the conditional n-gram module.
-
Citations
40 Claims
-
1. A method comprising:
-
representing an input document image with a sequence of template identifiers to reduce storage consumed by the input document image; and
replacing the template identifiers with alphabet characters according to language statistics to generate a text string representative of text in the input document image. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
16. A method comprising using a hidden Markov model to solve a substitution cipher.
-
17. A document processing system comprising:
-
a deciphering module to generate a first text string based on a sequence of template identifiers in a first symbolically compressed document image and to generate a second text string based on a sequence of template identifiers in a second symbolically compressed document image;
a conditional n-gram module coupled to receive the first and second text strings from the deciphering module, the conditional n-gram module being configured to extract n-gram indexing terms from the first and second text strings based on a predicate condition; and
a comparison module to generate a measure of similarity between the first and the second symbolically compressed document image based on the indexing terms extracted by the conditional n-gram module.
-
- 33. An apparatus comprising a deciphering module to apply a hidden Markov model to decipher a sequence of template identifiers in a symbolically compressed document to recover a text string from the symbolically compressed document.
-
34. A method of extracting n-grams from a text, the method comprising:
-
automatically selecting alphabetic characters in the text that satisfy a predicate; and
concatenating the selected alphabetic characters to form n-grams.
-
-
36. An article of manufacture including one or more computer-readable media that embody a program of instructions to generate a text string from an input document image represented by a sequence of template identifiers for the purpose of reducing storage consumed by the input document image, wherein the program of instructions, when executed by one or more processors in the processing system, causes the one or more processors to replace the template identifiers with alphabet characters according to language statistics to generate a text string representative of text in the input document image.
-
40. An article of manufacture including one or more computer-readable media that embody a program of instructions to generate a text string from an input document image represented by a sequence of template identifiers for the purpose of reducing storage consumed by the input document image, wherein the program of instructions, when executed by one or more processors in the processing system, causes the one or more processors to using a hidden Markov model to solve a substitution cipher formed by the sequence of template identifiers.
Specification