Automatic categorization of documents using document signatures
First Claim
1. A method of automatically classifying electronic documents using document signatures, the method comprising:
- (a) providing a plurality of document type distributions, each document type distribution describing layout characteristics of an independent document type and including data derived from at least one basis document signature from an independent basis document of the independent document type;
(b) providing a new electronic document;
(c) creating an new document signature describing layout characteristics of the new electronic document;
(d) calculating distances between the new document signature and each of the plurality of document type distributions; and
(e) selecting, based on the distances calculated in step (d), at least one candidate document type for the new electronic document from among the independent document types described by the plurality of document type distributions.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of quickly and automatically comparing a new document to a large number of previously seen documents and identifying the document type. First, provide a plurality of document type distributions, each document type distribution describes layout characteristics of an independent document type and may include a plurality of data points. Each document type distribution includes data derived from at least one basis document signature which may include data defining pixels of a low-resolution image of the independent basis document resolved to between 1 and 75 dots per inch or may include document segmentation data derived from the independent basis document. Next provide a new electronic document. Then create new document signature from the new electronic document. Next, distances between the new document signature and each of the plurality of document type distributions are calculated using an algorithm based on a Bayesian framework for a Gaussian distribution. The distances calculated may be Euclidean distances or may be Mahalanobis distances. Additionally, calculating the distances may include weighting the value given each of a plurality of data points in the document signatures based on the usefulness of each of the plurality of data points in distinguishing between the document signatures. Next, select at least one candidate document type for the new electronic document from among the independent document types described by the plurality of document type distributions. The selection of the at least one candidate document type may include selecting a preselected fixed number of the independent document types or may include selecting the independent document types described by those of the plurality of document type distributions having calculated distances that are within a preselected threshold distance of the smallest of the distances calculated. In addition, the invention provides for a program storage medium readable by computer, tangibly embodying a program of instructions executable by the computer to perform the method steps described above.
-
Citations
22 Claims
-
1. A method of automatically classifying electronic documents using document signatures, the method comprising:
-
(a) providing a plurality of document type distributions, each document type distribution describing layout characteristics of an independent document type and including data derived from at least one basis document signature from an independent basis document of the independent document type;
(b) providing a new electronic document;
(c) creating an new document signature describing layout characteristics of the new electronic document;
(d) calculating distances between the new document signature and each of the plurality of document type distributions; and
(e) selecting, based on the distances calculated in step (d), at least one candidate document type for the new electronic document from among the independent document types described by the plurality of document type distributions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
the at least one basis document signature in step (a) includes data defining pixels of a low-resolution image of the independent basis document; and
the new document signature in step (c) includes data defining pixels of a low-resolution image of the new electronic document.
-
-
4. The method of claim 3, in which the data derived from at least one basis document signature in step (a) includes a multiple representative statistic value across each of the at least one basis document signatures of each of the pixels of the low-resolution image.
-
5. The method of claim 3, in which:
-
the low-resolution image of the independent basis document is resolved to between 1 and 75 dots per inch; and
the low-resolution image of the new electronic document is resolved to between 1 and 75 dots per inch.
-
-
6. The method of claim 1, in which:
-
the at least one basis document signature in step (a) includes document segmentation data derived from the independent basis document of the independent document type; and
the new document signature in step (c) includes document segmentation data derived from the new electronic document.
-
-
7. The method of claim 6, in which the data derived from at least one basis document signature in step (a) includes a multiple representative statistic across each of the at least one basis document signature of document segmentation data.
-
8. The method of claim 2, in which selecting the at least one candidate document type in step (e) includes selecting a preselected fixed number of independent document types described by the preselected fixed number of the plurality of document type distributions calculated in step (d) to have the preselected fixed number of shortest distances.
-
9. The method of claim 2, in which selecting the at least one candidate document type in step (e) includes selecting the independent document types described by those of the plurality of document type distributions having distances calculated in step (d) within a preselected threshold distance of a minimal distance calculated in step (d).
-
10. The method of claim 2, in which the distances calculated in step (d) are Euclidean distances.
-
11. The method of claim 2, in which the distances calculated in step (d) are Mahalanobis distances.
-
12. The method of claim 2, in which:
-
each of the plurality of document type distributions provided in step (a) includes a plurality of data points; and
calculating distances in step (d) includes weighting the value given each of the plurality of data points based on a calculated reliability of each of the plurality of data points.
-
-
13. The method of claim 11, in which the calculated reliability of each of the plurality of data points includes the ratio of:
-
a spread of each of the plurality of data points within each of the plurality of document type distributions, respectively, to a spread of each of the plurality of data points across all of the plurality of document type distributions, respectively.
-
-
14. A program storage medium readable by a computer, tangibly embodying a program of instructions executable by the computer to perform method steps for automatically classifying electronic documents using document signatures, the method steps comprising:
-
(a) providing a plurality of document type distributions, each document type distribution describing layout characteristics of an independent document type and including data derived from at least one basis document signature from an independent basis document of the independent document type;
(b) providing a new electronic document;
(c) creating an new document signature describing layout characteristics of the new electronic document;
(d) calculating distances between the new document signature and each of the plurality of document type distributions using an algorithm based on a Bayesian framework for a Gaussian distribution; and
(e) selecting, based on the distances calculated in method step (d), at least one candidate document type for the new electronic document from among the independent document types described by the plurality of document type distributions. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
the at least one basis document signature in method step (a) includes data defining pixels of a low-resolution image of the independent basis document; and
the new document signature in method step (c) includes data defining pixels of a low-resolution image of the new electronic document.
-
-
16. The program storage medium of claim 15, in which:
-
the low-resolution image of the independent basis document is resolved to between 1 and 75 dots per inch; and
the low-resolution image of the new electronic document is resolved to between 1 and 75 dots per inch.
-
-
17. The program storage medium of claim 14, in which:
-
the at least one basis document signature in method step (a) includes document segmentation data derived from the independent basis document; and
the new document signature in method step (c) includes document segmentation data derived from the new electronic document.
-
-
18. The program storage medium of claim 14, in which selecting the at least one candidate document type in method step (e) includes selecting a preselected fixed number of independent document types described by the preselected fixed number of the plurality of document type distributions calculated in method step (d) to have the preselected fixed number of shortest distances.
-
19. The program storage medium of claim 14, in which in which selecting the at least one candidate document type in method step (e) includes selecting the independent document types described by those of the plurality of document type distributions having distances calculated in method step (d) within a preselected threshold distance of a minimal distance calculated in method step (d).
-
20. The program storage medium of claim 14, in which the distances calculated in method step (d) are Euclidean distances.
-
21. The program storage medium of claim 14, in which:
-
each of the plurality of document type distributions provided in method step (a) includes a plurality of data points; and
calculating distances in method step (d) includes weighting the value given each of the plurality of data points based on a calculated reliability of each of the plurality of data points.
-
-
22. A program storage medium of claim 21, in which in which the calculated reliability of each of the plurality of data points includes the ratio of:
a spread of each of the plurality of data points within each of the plurality of document type distributions, respectively, to a spread of each of the plurality of data points across all of the plurality of document type distributions, respectively.
Specification