Compressed document matching
First Claim
1. An apparatus for determining if a query document matches one or more documents in a database, the apparatus comprising:
- means for identifying up endpoints and down endpoints in the query document, the up endpoints representing tops of features in the query document and the down endpoints representing bottoms of features in the query document;
means for generating a set of descriptors for the query document based on locations of the up endpoints and the down endpoints;
means for comparing the set of descriptors for the query document against respective sets of descriptors associated with the one or more documents in the database to determine if the query document matches at least one of the one or more documents;
wherein the means for generating a set of descriptors for the query document based on locations of the up endpoints and the down endpoints comprises means for identifying text lines in the query document based on concentrations of up endpoints and down endpoints along scanlines of the query document; and
means for generating the set of descriptors based on distances between selected up endpoints and selected down endpoints within the text lines in the query document; and
wherein the means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document comprises;
means for determining the number of up endpoints and the number of down endpoints that lie on each of the scanlines; and
means for identifying respective pairs of scanlines that have a local maximum number of up endpoints and a local maximum number of down endpoints as text lines.
0 Assignments
0 Petitions
Accused Products
Abstract
An apparatus and method for determining if a query document matches one or more of a plurality of documents in a database. In a coarse matching stage, a compressed file or other query document is scanned to produce a bit profile. Global statistics such as line spacing and text height are calculated from the bit profile and used to narrow the field of documents to be searched in an image database. The bit profile is cross-correlated with bit profiles of documents in the search space to identify candidates for a detailed matching stage. If multiple candidates are generated in the coarse matching stage, a set of endpoint features is extracted from the query document for detailed matching in the detailed matching stage. Endpoint features contain sufficient information for various levels of processing, including page skew and orientation estimation. In addition, endpoint features are stable, symmetric and easily computable from commonly used compressed files including, but not limited to, CCITT Group 4 compressed files. Endpoint features extracted in the detailed matching stage are used to correctly identify a matching document in a high percentage of cases.
-
Citations
34 Claims
-
1. An apparatus for determining if a query document matches one or more documents in a database, the apparatus comprising:
-
means for identifying up endpoints and down endpoints in the query document, the up endpoints representing tops of features in the query document and the down endpoints representing bottoms of features in the query document;
means for generating a set of descriptors for the query document based on locations of the up endpoints and the down endpoints;
means for comparing the set of descriptors for the query document against respective sets of descriptors associated with the one or more documents in the database to determine if the query document matches at least one of the one or more documents;
wherein the means for generating a set of descriptors for the query document based on locations of the up endpoints and the down endpoints comprises means for identifying text lines in the query document based on concentrations of up endpoints and down endpoints along scanlines of the query document; and
means for generating the set of descriptors based on distances between selected up endpoints and selected down endpoints within the text lines in the query document; and
wherein the means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document comprises;
means for determining the number of up endpoints and the number of down endpoints that lie on each of the scanlines; and
means for identifying respective pairs of scanlines that have a local maximum number of up endpoints and a local maximum number of down endpoints as text lines.
-
-
2. An apparatus for determining if a query document matches one or more documents in a database, the apparatus comprising:
-
means for generating a bit profile of the query document based on the number of bits required to encode each of a plurality of rows of pixels in the query document;
means for comparing the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to identify one or more candidate documents;
means for identifying endpoint features in the query document;
means for generating a set of descriptors for the query document based on locations of the endpoint features;
means for comparing the set of descriptors for the query document against respective sets of descriptors for the one or more candidate documents to determine if the query document matches at least one of the one or more candidate documents;
means for performing spectral analysis on the bit profile of the query document to determine global statistics of the query document; and
means for comparing the global statistics of the query document against global statistics associated with a second plurality of documents from the database to identify the first plurality of documents, the first plurality of documents being a subset of the second plurality of documents. - View Dependent Claims (3)
-
-
4. An apparatus for generating a set of descriptors for identifying a document, the apparatus comprising:
-
means for identifying up endpoints and down endpoints in the document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document; and
means for generating a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints;
wherein the means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document comprises;
means for determining the number of up endpoints and the number of down endpoints that lie on each of the scanlines; and
means for identifying respective pairs of scanlines that have a local maximum number of up endpoints and a local maximum number of down endpoints as text lines.
-
-
5. apparatus for generating a set of descriptors for identifying a document, the apparatus comprising:
-
means for identifying up endpoints and down endpoints in the document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document; and
means for generating a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints;
wherein the means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document comprises;
means for determining a dominant line spacing in the document;
means for determining the number of up endpoints and the number of down endpoints that lie on each of the scanlines; and
means for identifying as text lines respective scanline pairs in which the constituent scanlines are separated by a distance less than the dominant line spacing and in which the constituent scanlines respectively have a local maximum number of up endpoints and a local maximum number of down endpoints as text lines. - View Dependent Claims (6)
-
-
7. An apparatus for generating a set of descriptors for identifying a document, the apparatus comprising:
-
means for identifying up endpoints and down endpoints in the document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document;
means for generating a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints; and
means for generating a respective endpoint profile for each of the scanlines, the endpoint profile including a count of up endpoints identified on the scanline and a count of down endpoints identified on the scanline, and wherein the means for identifying text lines based on concentrations of up endpoints and down endpoints along scanlines of the document comprises means for reducing all but local maximums of the counts of up endpoints and the counts of down endpoints in respective endpoint profiles.
-
-
8. An apparatus for generating a set of descriptors for identifying a document, the apparatus comprising:
-
means for identifying up endpoints and down endpoints in the document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document; and
means for generating a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints;
wherein the means for identifying text lines based on concentrations of up endpoints and down endpoints along scanlines of the document comprises;
means for generating a count of up endpoints and a count of down endpoints for each of the scanlines;
means for identifying a first scanline within a locality of scanlines that has the highest count of up endpoints;
means for reducing the count of up endpoints associated with each scanline within the locality of scanlines except the first scanline;
means for identifying a second scanline within the locality of scanlines that has the highest count of down endpoints; and
means for reducing the count of down endpoints associated with each scanline within the locality of scanlines except the second scanline. - View Dependent Claims (9)
-
-
10. An apparatus for generating a set of descriptors for identifying a document, the apparatus comprising:
-
means for identifying up endpoints and down endpoints in the document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
means for identifying text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document; and
means for generating a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints;
wherein the means for generating a set of descriptors based on distances between selected up endpoints and selected down endpoints comprises means for defining an ascender zone and a descender zone for each of the text lines, the selected up endpoints being up endpoints in the ascender zone and the selected down endpoints being down endpoints in the descender zone. - View Dependent Claims (11, 12)
-
-
13. An apparatus for generating information that can be used to identify a document, the apparatus comprising:
-
means for generating a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
means for performing spectral analysis on the bit profile to determine global statistics of the document including means for generating an estimation of a dominant line spacing in the document, wherein the means for generating an estimation of a dominant line spacing comprises means for generating a power spectrum density from the bit profile and means for calculating the estimation of the dominant line spacing from a peak value in the power spectrum density.
-
-
14. An apparatus for generating information that can be used to identify a document, the apparatus comprising:
-
means for generating a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
means for performing spectral analysis on the bit profile to determine global statistics of the document, wherein the means for performing spectral analysis on the bit profile to determine global statistics comprises means for generating an estimation of a proportion of the document that is text, and further wherein the means for generating an estimation of a proportion of the document that is text comprises means for generating a power spectrum density from the bit profile and means for calculating the estimation of the proportion of the document based on an energy under a peak value in the power spectrum density.
-
-
15. An apparatus for generating information that can be used to identify a document, the apparatus comprising:
-
means for generating a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in the document;
means for performing spectral analysis on the bit profile to determine global statistics of the document, wherein means for performing spectral analysis on the bit profile to determine global statistics comprises means for generating an estimation of a location of text in the document, and wherein the means for generating an estimation of a location of text in the document comprises means for applying a bandpass filter to the bit profile to generate a text energy profile, and means for determining a centroid of the text energy profile to be the estimation of the location of text in the document. - View Dependent Claims (16)
-
-
17. An apparatus for generating information that can be used to identify a document, the apparatus comprising:
-
means for generating a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in the document; and
means for performing spectral analysis on the bit profile to determine global statistics of the document, wherein the means for performing spectral analysis on the bit profile to determine global statistics comprises the means for generating an estimation of text concentration in the document, the estimation of text concentration indicating a lengthwise measure of a proportion of the document that is text, and further wherein the means for generating an estimation of text concentration in the document comprises;
means for applying a bandpass filter to the bit profile to generate a text energy profile; and
means for determining the estimation of the text concentration based on a length of the text energy profile.
-
-
18. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
identify up endpoints and down endpoints in a query document, the up endpoints representing tops of features in the query document and the down endpoints representing bottoms of features in the query document;
generate a set of descriptors for the query document based on locations of the up endpoints and the down endpoints by identifying text lines in the query document based on concentrations of up endpoints and down endpoints along scanlines of the query document by, determining the number of up endpoints and the number of down endpoints that lie on each of the scanlines; and
identifying respective pairs of scanlines that have a local maximum number of up endpoints and a local maximum number of down endpoints as text lines; and
generating the set of descriptors based on distances between selected up endpoints and selected down endpoints within the text lines in the query document;
compare the set of descriptors for the query document against respective sets of descriptors associated with the one or more documents in the database to determine if the query document matches at least one of the one or more documents.
-
-
19. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
generate a bit profile of a query document based on the number of bits required to encode each of a plurality of rows of pixels in the query document;
compare the bit profile of the query document against bit profiles associated with a first plurality of documents from the database to identify one or more candidate documents;
identify endpoint features in the query document;
generate a set of descriptors for the query document based on locations of the endpoint features;
compare the set of descriptors for the query document against respective sets of descriptors for the one or more candidate documents to determine if the query document matches at least one of the one or more candidate documents;
perform spectral analysis on the bit profile of the query document to determine global statistics of the query document; and
compare the global statistics of the query document against global statistics associated with a second plurality of documents from the database to identify the first plurality of documents, the first plurality of documents being a subset of the second plurality of documents. - View Dependent Claims (20)
-
-
21. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
identify up endpoints and down endpoints in a document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
identify text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document by determining the number of up endpoints and the number of down endpoints that lie on each of the scanlines, and identifying respective pairs of scanlines that have a local maximum number of up endpoints and a local maximum number of down endpoints as text lines; and
generate a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints.
-
-
22. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
identify up endpoints and down endpoints in a document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
identify text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document by determining a dominant line spacing in the document, determining the number of up endpoints and the number of down endpoints that lie on each of the scanlines, and identifying as text lines respective scanline pairs in which the constituent scanlines are separated by a distance less than the dominant line spacing and in which the constituent scanlines respectively have a local maximum number of up endpoints and a local maximum number of down endpoints as text lines; and
generate a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints. - View Dependent Claims (23)
-
-
24. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
identify up endpoints and down endpoints in a document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
identify text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document;
generate a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints; and
generate a respective endpoint profile for each of the scanlines, the endpoint profile including a count of up endpoints identified on the scanline and a count of down endpoints identified on the scanline, and wherein identifying text lines based on concentrations of up endpoints and down endpoints along scanlines of the document comprises reducing all but local maximums of the counts of up endpoints and the counts of down endpoints in respective endpoint profiles.
-
-
25. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
identify up endpoints and down endpoints in a document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
identify text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document by generating a count of up endpoints and a count of down endpoints for each of the scanlines, identifying a first scanline within a locality of scanlines that has the highest count of up endpoints, reducing the count of up endpoints associated with each scanline within the locality of scanlines except the first scanline, identifying a second scanline within the locality of scanlines that has the highest count of down endpoints, and reducing the count of down endpoints associated with each scanline within the locality of scanlines except the second scanline; and
generate a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints. - View Dependent Claims (26)
-
-
27. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
identify up endpoints and down endpoints in a document, the up endpoints representing tops of features in the document and the down endpoints representing bottoms of features in the document;
identify text lines in the document based on concentrations of up endpoints and down endpoints along scanlines of the document; and
generate a set of descriptors based on distances between selected up endpoints and selected down endpoints in the concentrations of up endpoints and down endpoints;
wherein the set of descriptors are generated by defining an ascender zone and a descender zone for each of the text lines, the selected up endpoints being up endpoints in the ascender zone and the selected down endpoints being down endpoints in the descender zone. - View Dependent Claims (28, 29)
-
-
30. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
generate a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in a document; and
perform spectral analysis on the bit profile to determine global statistics of the document;
wherein performing spectral analysis on the bit profile to determine global statistics comprises generating an estimation of a dominant line spacing in the document; and
wherein generating an estimation of a dominant line spacing comprises generating a power spectrum density from the bit profile and calculating the estimation of the dominant line spacing from a peak value in the power spectrum density.
-
-
31. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
generate a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in a document by;
perform spectral analysis on the bit profile to determine global statistics of the document by generating an estimation of a proportion of the document that is text, wherein generating an estimation of a proportion of the document that is text comprises generating a power spectrum density from the bit profile and calculating the estimation of the proportion of the document based on an energy under a peak value in the power spectrum density.
-
-
32. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
generate a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in a document; and
perform spectral analysis on the bit profile to determine global statistics of the document by generating an estimation of a location of text in the document, wherein generating an estimation of a location of text in the document comprises applying a bandpass filter to the bit profile to generate a text energy profile; and
determining a centroid of the text energy profile to be the estimation of the location of text in the document. - View Dependent Claims (33)
-
-
34. An article of manufacture having one or more recordable media with executable instructions stored thereon which, when executed by a system, cause the system to:
-
generate a bit profile based on the number of bits required to encode each of a plurality of rows of pixels in a document; and
perform spectral analysis on the bit profile to determine global statistics of the document by generating an estimation of text concentration in the document, the estimation of text concentration indicating a lengthwise measure of a proportion of the document that is text, wherein generating an estimation of text concentration in the document is performed by;
applying a bandpass filter to the bit profile to generate a text energy profile; and
determining the estimation of the text concentration based on a length of the text energy profile.
-
Specification