Compressed document matching
First Claim
1. A method of determining if a query document matches one or more documents in a database, the method comprising:
- generating a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
comparing the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to determine if the query document matches one or more of the first plurality of documents.
0 Assignments
0 Petitions
Accused Products
Abstract
An apparatus and method for determining if a query document matches one or more of a plurality of documents in a database. In a coarse matching stage, a compressed file or other query document is scanned to produce a bit profile. Global statistics such as line spacing and text height are calculated from the bit profile and used to narrow the field of documents to be searched in an image database. The bit profile is cross-correlated with bit profiles of documents in the search space to identify candidates for a detailed matching stage. If multiple candidates are generated in the coarse matching stage, a set of endpoint features is extracted from the query document for detailed matching in the detailed matching stage. Endpoint features contain sufficient information for various levels of processing, including page skew and orientation estimation. In addition, endpoint features are stable, symmetric and easily computable from commonly used compressed files including, but not limited to, CCITT Group 4 compressed files. Endpoint features extracted in the detailed matching stage are used to correctly identify a matching document in a high percentage of cases.
49 Citations
51 Claims
-
1. A method of determining if a query document matches one or more documents in a database, the method comprising:
-
generating a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
comparing the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to determine if the query document matches one or more of the first plurality of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of determining if a query document matches one or more documents in a database, the method comprising:
-
identifying up endpoints and down endpoints in the query document, the up endpoints representing tops of features in the query document and the down endpoints representing bottoms of features in the query document;
generating a set of descriptors for the query document based on locations of the up endpoints and the down endpoints; and
comparing the set of descriptors for the query document against respective sets of descriptors associated with the one or more documents in the database to determine if the query document matches at least one of the one or more documents. - View Dependent Claims (12, 13, 14, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 39, 40, 41, 42, 43)
-
-
15. A method of determining if a query document matches one or more documents in a database, the method comprising:
-
generating a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the query document;
comparing the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to identify one or more candidate documents;
identifying endpoint features in the query document;
generating a set of descriptors for the query document based on locations of the endpoint features; and
comparing the set of descriptors for the query document against respective sets of descriptors for the one or more candidate documents to determine if the query document matches at least one of the one or more candidate documents.
-
-
19. A method of generating a set of descriptors for identifying a document, the method comprising:
-
identifying up endpoints and down endpoints in the document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document; and
generating a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints.
-
-
34. A method of generating information that can be used to identify a document, the method comprising:
-
generating a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
performing spectral analysis on the bit profile to determine global statistics of the document.
-
-
44. An article of manufacture including one or more computer-readable media that embody a program of instructions to configure a processing system to determine if a query document matches one or more documents in a database, wherein the program of instructions, when executed by one or more processors in the processing system, causes the one or more processors to:
-
generate a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
compare the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to determine if the query document matches one or more of the first plurality of documents. - View Dependent Claims (45, 46)
-
-
47. An article of manufacture including one or more computer-readable media that embody a program of instructions to configure a processing system to determine if a query document matches one or more documents in a database, wherein the program of instructions, when executed by one or more processors in the processing system, causes the one or more processors to:
-
identify up endpoints and down endpoints in the query document, the up endpoints representing tops of features in the query document and the down endpoints representing bottoms of features in the query document;
generate a set of descriptors for the query document based on locations of the up endpoints and the down endpoints; and
compare the set of descriptors for the query document against respective sets of descriptors associated with the one or more documents in the database to determine if the query document matches at least one of the one or more documents.
-
-
48. An article of manufacture including one or more computer-readable media that embody a program of instructions to configure a processing system to determine if a query document matches one or more documents in a database, wherein the program of instructions, when executed by one or more processors in the processing system, causes the one or more processors to:
-
generate a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the query document;
compare the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to identify one or more candidate documents;
identify endpoint features in the query document;
generate a set of descriptors for the query document based on locations of the endpoint features; and
compare the set of descriptors for the query document against respective sets of descriptors for the one or more candidate documents to determine if the query document matches at least one of the one or more candidate documents.
-
-
49. A data processing system comprising:
-
a database of document images; and
a computer that includes a processing unit and a memory, the memory having stored therein a program of instructions to configure the computer to determine if a query document matches one or more documents in the database, wherein the program of instructions, when executed by the processing unit of the computer, causes the computer to;
generate a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
compare the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to determine if the query document matches one or more of the first plurality of documents.
-
-
50. A data processing system comprising:
-
a database of document images; and
a computer that includes a processing unit and a memory, the memory having stored therein a program of instructions to configure the computer to determine if a query document matches one or more documents in the database, wherein the program of instructions, when executed by the processing unit of the computer, causes the computer to;
identify up endpoints and down endpoints in the query document, the up endpoints representing tops of features in the query document and the down endpoints representing bottoms of features in the query document;
generate a set of descriptors for the query document based on locations of the up endpoints and the down endpoints; and
compare the set of descriptors for the query document against respective sets of descriptors associated with the one or more documents in the database to determine if the query document matches at least one of the one or more documents.
-
-
51. A data processing system comprising:
-
a database of document images; and
a computer that includes a processing unit and a memory, the memory having stored therein a program of instructions to configure the computer to determine if a query document matches one or more documents in the database, wherein the program of instructions, when executed by the processing unit of the computer, causes the computer to;
generate a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the query document;
compare the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to identify one or more candidate documents;
identify endpoint features in the query document;
generate a set of descriptors for the query document based on locations of the endpoint features; and
compare the set of descriptors for the query document against respective sets of descriptors for the one or more candidate documents to determine if the query document matches at least one of the one or more candidate documents.
-
Specification