Document identification by characteristics matching

US 5,159,667 A
Filed: 05/31/1989
Issued: 10/27/1992
Est. Priority Date: 05/31/1989
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented process for classifying documents comprising the steps of:

preliminarily creating a knowledge base of documents each characterized by a hierarchy of objects that are defined by parameters indicating physical and relational characteristics, the hierarchy being organized from a lowest object level to one or more successively higher object levels and storing said knowledge base in a computer;

scanning a document to form binary light and dark pixels and inputting into said computer data representing the pixels;

performing, in said computer, the following steps;

segmenting the document into primary areas of significance based on the pixels;

calculating parameters that define the segmented primary areas;

comparing the parameters of each segmented primary area with the parameters of the lowest level objects in the hierarchy of objects that characterize each document in the knowledge base;

assigning to each segmented primary area weights of evidence relative to the lowest level objects based on the comparison;

generating a weighted hypothesis of a label for each of the segmented areas based on the weights of evidence relative to the lowest level objects;

grouping the segmented primary areas into areas of significance more relevant than the primary areas;

calculating parameters that define the more relevant areas;

comparing the parameters of each more relevant area with the parameters of the second lowest level objects in the hierarchy;

assigning to each more relevant area weights of evidence relative to the second lowest level objects based on the comparison and reevaluating the weights of evidence assigned to the segmented primary areas;

generating a weighted hypothesis of a label for each of the more relevant areas and revising the weighted hypothesis of the label for each of the segmented primary areas based on the weights of evidence of the second lowest level objects and the lowest level objects; and

classifying the document based on the labels and the weights of evidence developed by the preceding step.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This invention relates to an automatic identification method for scanned documents in an electronic document capture and storage system. The invention uses the technique of recognition of global document features compared to a knowledge base of known document types. The system first segments the digitized image of a document into physical and logical areas of significance and attempts to label these areas by determining the type of information they contain, without using OCR techniques. The system then attempts to match the areas segmented to objects described in the knowledge base. The system labels the areas successfully matched then selects the most probable document type based on the areas found within the document. Using computer learning methods, the system is capable of improving its knowledge of the documents it is supposed to recognize, by dynamically modifying the characteristics of its knowledge base thus sharpening its decision making capability.

343 Citations

5 Claims

1. A computer-implemented process for classifying documents comprising the steps of:
- preliminarily creating a knowledge base of documents each characterized by a hierarchy of objects that are defined by parameters indicating physical and relational characteristics, the hierarchy being organized from a lowest object level to one or more successively higher object levels and storing said knowledge base in a computer;
  
  scanning a document to form binary light and dark pixels and inputting into said computer data representing the pixels;
  
  performing, in said computer, the following steps;
  
  segmenting the document into primary areas of significance based on the pixels;
  
  calculating parameters that define the segmented primary areas;
  
  comparing the parameters of each segmented primary area with the parameters of the lowest level objects in the hierarchy of objects that characterize each document in the knowledge base;
  
  assigning to each segmented primary area weights of evidence relative to the lowest level objects based on the comparison;
  
  generating a weighted hypothesis of a label for each of the segmented areas based on the weights of evidence relative to the lowest level objects;
  
  grouping the segmented primary areas into areas of significance more relevant than the primary areas;
  
  calculating parameters that define the more relevant areas;
  
  comparing the parameters of each more relevant area with the parameters of the second lowest level objects in the hierarchy;
  
  assigning to each more relevant area weights of evidence relative to the second lowest level objects based on the comparison and reevaluating the weights of evidence assigned to the segmented primary areas;
  
  generating a weighted hypothesis of a label for each of the more relevant areas and revising the weighted hypothesis of the label for each of the segmented primary areas based on the weights of evidence of the second lowest level objects and the lowest level objects; and
  
  classifying the document based on the labels and the weights of evidence developed by the preceding step.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The process of claim 1, additionally comprising the steps of:
    - grouping the more relevant areas into still more relevant areas of significance;
      
      calculating parameters that define the still more relevant areas;
      
      comparing the parameters of each still more relevant area with the parameters of the third lowest level objects in the hierarchy;
      
      assigning to each still more relevant area weights of evidence relative to the third lowest level objects based on the comparison and reevaluating the weights of evidence assigned to the more relevant areas and the segmented primary areas;
      
      generating a weighted hypothesis of a label for each of the still more relevant areas and revising the weighted hypothesis of the label for each of the more relevant areas and the segmented primary areas based on the weights of evidence of the third lowest level objects, the second lowest level objects and the lowest level objects; and
      
      classifying the document based on the labels and the weights of evidence developed by the preceding step.
  - 3. The process of claim 2, in which the recited steps are performed one or more additional times with respect to successively more relevant areas of significance and higher level objects, thereby increasing the evidence that supports the document classification.
  - 4. The process of claim 1, additionally comprising the step of forming from the pixels darkness density histograms, the segmenting step segmenting the document based on the histograms.
  - 5. The process of claim 4, additionally comprising the steps of:
    - comparing the density histograms of one side of the individual objects with the density histograms of the other side of the individual objects;
      
      determining the vertical shift between the histograms of the two sides of the individual objects;
      
      averaging the determined vertical shift of the individual objects; and
      
      correcting the skew of an entire edge of the document based on the averaged vertical shift.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kofax Incorporated
Original Assignee
Daniel G. Borrey, Roland G. Borrey
Inventors
Borrey, Roland G., Borrey, Daniel G.
Primary Examiner(s)
Hecker, Stuart N.
Assistant Examiner(s)
Rudolph, Rebecca L.

Application Number

US07/359,839
Time in Patent Office

1,245 Days
Field of Search

364/200 MS File, 364/900 MS File, 364/513, 364/518, 382/46
US Class Current

715/205
CPC Class Codes

G06F 16/93   Document management systems

G06V 30/40   Document-oriented image-bas...

Y10S 706/90   Fuzzy logic

Document identification by characteristics matching

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

343 Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Document identification by characteristics matching

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

343 Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links