Method and System of Pre-Analysis and Automated Classification of Documents
First Claim
1. A method for a computer system to perform an analysis of document type, the method comprising:
- providing to the computer system a document image;
detecting at least one feature in the document image;
assigning a text to the at least one feature in the document image;
matching the document image to one or more nodes of at least one decision tree based at least in part upon the text assigned to the at least one feature in the document image; and
associating the document image with one or more document types based at least in part upon the matching the document image to the one or more nodes of the at least one decision tree.
3 Assignments
0 Petitions
Accused Products
Abstract
Automatic classification of different types of documents is disclosed. An image of a form or document is captured. The document is assigned to one or more type definitions by identifying one or more objects within the image of the document. A matching model is selected via identification of the document image. In the case of multiple identifications, a profound analysis of the document type is performed—either automatically or manually. An automatic classifier may be trained with document samples of each of a plurality of document classes or document types where the types are known in advance or a system of classes may be formed automatically without a priori information about types of samples. An automatic classifier determines possible features and calculates a range of feature values and possible other feature parameters for each type or class of document. A decision tree, based on rules specified by a user, may be used for classifying documents. Processing, such as optical character recognition (OCR), may be used in the classification process.
-
Citations
20 Claims
-
1. A method for a computer system to perform an analysis of document type, the method comprising:
-
providing to the computer system a document image; detecting at least one feature in the document image; assigning a text to the at least one feature in the document image; matching the document image to one or more nodes of at least one decision tree based at least in part upon the text assigned to the at least one feature in the document image; and associating the document image with one or more document types based at least in part upon the matching the document image to the one or more nodes of the at least one decision tree. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. One or more computer readable media configured to bear a device detectable implementation of a method, the method comprising:
-
identifying one or more document features in a document image; correlating one or more of the one or more document features with one or more document classes; forming a decision tree based at least in part upon the identified one or more document features in the document, wherein forming the decision tree includes creating a node corresponding to each of the one or more document classes; and associating with one or more of the document classes the document image based in part upon the decision tree and the document image.
-
- 12. The one or more computer readable media of claim 12, wherein the identified one or more document features are document features that were previously determined to be one or more of the most reliable document features capable of distinguishing documents, and wherein the one or more most reliable document features were previously identified by analysis of a plurality of training documents each having at least one feature different from at least one of the other training documents.
-
19. A system for classifying an unclassified document, the system comprising:
-
a decision tree trainer that is configured to receive a plurality of training documents, identify one or more features in the training documents, identify one or more document classes based on the one or more features in the training documents, and create a node or sub-node in the decision tree for each of the one or more document classes; and a document classifier that is configured to classify an unclassified document based in part on one or more features identified in an image associated with the unclassified document and in part on one or more nodes of the decision tree, in part on one or more sub-nodes of the decision tree, or in part on a combination of one or more nodes of the decision tree and one or more sub-nodes of the decision tree. - View Dependent Claims (20)
-
Specification