Document identification by characteristics matching
First Claim
1. A computer-implemented process for classifying documents comprising the steps of:
- preliminarily creating a knowledge base of documents each characterized by a hierarchy of objects that are defined by parameters indicating physical and relational characteristics, the hierarchy being organized from a lowest object level to one or more successively higher object levels and storing said knowledge base in a computer;
scanning a document to form binary light and dark pixels and inputting into said computer data representing the pixels;
performing, in said computer, the following steps;
segmenting the document into primary areas of significance based on the pixels;
calculating parameters that define the segmented primary areas;
comparing the parameters of each segmented primary area with the parameters of the lowest level objects in the hierarchy of objects that characterize each document in the knowledge base;
assigning to each segmented primary area weights of evidence relative to the lowest level objects based on the comparison;
generating a weighted hypothesis of a label for each of the segmented areas based on the weights of evidence relative to the lowest level objects;
grouping the segmented primary areas into areas of significance more relevant than the primary areas;
calculating parameters that define the more relevant areas;
comparing the parameters of each more relevant area with the parameters of the second lowest level objects in the hierarchy;
assigning to each more relevant area weights of evidence relative to the second lowest level objects based on the comparison and reevaluating the weights of evidence assigned to the segmented primary areas;
generating a weighted hypothesis of a label for each of the more relevant areas and revising the weighted hypothesis of the label for each of the segmented primary areas based on the weights of evidence of the second lowest level objects and the lowest level objects; and
classifying the document based on the labels and the weights of evidence developed by the preceding step.
5 Assignments
0 Petitions
Accused Products
Abstract
This invention relates to an automatic identification method for scanned documents in an electronic document capture and storage system. The invention uses the technique of recognition of global document features compared to a knowledge base of known document types. The system first segments the digitized image of a document into physical and logical areas of significance and attempts to label these areas by determining the type of information they contain, without using OCR techniques. The system then attempts to match the areas segmented to objects described in the knowledge base. The system labels the areas successfully matched then selects the most probable document type based on the areas found within the document. Using computer learning methods, the system is capable of improving its knowledge of the documents it is supposed to recognize, by dynamically modifying the characteristics of its knowledge base thus sharpening its decision making capability.
343 Citations
5 Claims
-
1. A computer-implemented process for classifying documents comprising the steps of:
-
preliminarily creating a knowledge base of documents each characterized by a hierarchy of objects that are defined by parameters indicating physical and relational characteristics, the hierarchy being organized from a lowest object level to one or more successively higher object levels and storing said knowledge base in a computer; scanning a document to form binary light and dark pixels and inputting into said computer data representing the pixels; performing, in said computer, the following steps; segmenting the document into primary areas of significance based on the pixels; calculating parameters that define the segmented primary areas; comparing the parameters of each segmented primary area with the parameters of the lowest level objects in the hierarchy of objects that characterize each document in the knowledge base; assigning to each segmented primary area weights of evidence relative to the lowest level objects based on the comparison; generating a weighted hypothesis of a label for each of the segmented areas based on the weights of evidence relative to the lowest level objects; grouping the segmented primary areas into areas of significance more relevant than the primary areas; calculating parameters that define the more relevant areas; comparing the parameters of each more relevant area with the parameters of the second lowest level objects in the hierarchy; assigning to each more relevant area weights of evidence relative to the second lowest level objects based on the comparison and reevaluating the weights of evidence assigned to the segmented primary areas; generating a weighted hypothesis of a label for each of the more relevant areas and revising the weighted hypothesis of the label for each of the segmented primary areas based on the weights of evidence of the second lowest level objects and the lowest level objects; and classifying the document based on the labels and the weights of evidence developed by the preceding step. - View Dependent Claims (2, 3, 4, 5)
-
Specification