×

Processing an electronic document for information extraction

  • US 7,672,940 B2
  • Filed: 04/29/2004
  • Issued: 03/02/2010
  • Est. Priority Date: 12/04/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method of identifying features to be used when extracting information from a document, comprising:

  • obtaining a set of training documents, the set comprising a plurality of training documents;

    identifying potential classifying keywords indicative of an informational element associated with the set of training documents;

    selecting a number of the potential classifying keywords based on a frequency of the potential classifying keywords in the plurality of training documents;

    identifying potential features of each of the selected classifying keywords in each of the plurality of training documents based on text of the selected classifying keywords, relation of the selected classifying keywords to other words identified from text in each training document, relation of the selected classifying keywords to graphic lines in each training document, and a layout of each training document;

    selecting a number of the potential features of the selected classifying keywords that are indicative of the informational element being associated with a document, wherein selecting comprises;

    assigning a score to each of the potential features; and

    selecting a number of the potential features based on the score assigned to the potential features; and

    utilizing the selected features to develop a classifier using a processor of a computing device, wherein the classifier is developed based on a combination of the selected features that is weighted based on the score assigned to each of the selected features, the classifier being configured to be utilized to extract information from the document.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×