Processing an electronic document for information extraction

US 7,672,940 B2
Filed: 04/29/2004
Issued: 03/02/2010
Est. Priority Date: 12/04/2003
Status: Active Grant

First Claim

Patent Images

1. A method of identifying features to be used when extracting information from a document, comprising:

obtaining a set of training documents, the set comprising a plurality of training documents;

identifying potential classifying keywords indicative of an informational element associated with the set of training documents;

selecting a number of the potential classifying keywords based on a frequency of the potential classifying keywords in the plurality of training documents;

identifying potential features of each of the selected classifying keywords in each of the plurality of training documents based on text of the selected classifying keywords, relation of the selected classifying keywords to other words identified from text in each training document, relation of the selected classifying keywords to graphic lines in each training document, and a layout of each training document;

selecting a number of the potential features of the selected classifying keywords that are indicative of the informational element being associated with a document, wherein selecting comprises;

assigning a score to each of the potential features; and

selecting a number of the potential features based on the score assigned to the potential features; and

utilizing the selected features to develop a classifier using a processor of a computing device, wherein the classifier is developed based on a combination of the selected features that is weighted based on the score assigned to each of the selected features, the classifier being configured to be utilized to extract information from the document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates generally to automatically processing electronic documents. In one aspect, features and/or properties of words are identified from a set of training documents to aid in extracting information from documents to be processed. The features and/or properties relate to text of the words, position of the words and the relationship to other words. A classifier is developed to express these features and/or properties. During information extraction, documents are processed and analyzed based on the classifier and information is extracted based on correspondence of the documents and the features/properties expressed by the classifier.

Citations

10 Claims

1. A method of identifying features to be used when extracting information from a document, comprising:
- obtaining a set of training documents, the set comprising a plurality of training documents;
  
  identifying potential classifying keywords indicative of an informational element associated with the set of training documents;
  
  selecting a number of the potential classifying keywords based on a frequency of the potential classifying keywords in the plurality of training documents;
  
  identifying potential features of each of the selected classifying keywords in each of the plurality of training documents based on text of the selected classifying keywords, relation of the selected classifying keywords to other words identified from text in each training document, relation of the selected classifying keywords to graphic lines in each training document, and a layout of each training document;
  
  selecting a number of the potential features of the selected classifying keywords that are indicative of the informational element being associated with a document, wherein selecting comprises;
  
  assigning a score to each of the potential features; and
  
  selecting a number of the potential features based on the score assigned to the potential features; and
  
  utilizing the selected features to develop a classifier using a processor of a computing device, wherein the classifier is developed based on a combination of the selected features that is weighted based on the score assigned to each of the selected features, the classifier being configured to be utilized to extract information from the document.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1 and further comprising developing the classifier to express the selected features, the classifier including information related to a location of words in a document, a relationship of words to other words in a document, a relationship of graphic lines to other words in a document, and text of words in a document.
  - 3. The method of claim 1 wherein the potential features relate to at least one of text of a name and a distance from a first word to a second word.
  - 4. The method of claim 1 wherein the informational element is at least one of a document type and an informational field.
  - 5. The method of claim 1 wherein the informational element relates to at least one of a sender, a recipient and a subject.
  - 6. The method of claim 1 wherein selecting further comprises using a boosting algorithm to select the best features.

7. A method of processing a document, the method comprising:
- identifying keywords in the document indicative of an informational property of the document;
  
  assigning a score to each of the keywords in the document based on a location of each of the keywords, a relation of each of the keywords to other words identified from text in the document, a relation between graphic lines and each of the keywords, and text of each keyword;
  
  assigning a combined score to the document based on the score assigned to each of the keywords in the document, wherein assigning the combined score comprises assigning a combined score to the document for each of a plurality of types of document; and
  
  using a processor of a computing device, classifying the document as being one type of document selected from the plurality of types of document based on the combined score, wherein classifying the document comprises comparing the combined score to a threshold value.
- View Dependent Claims (8, 9, 10)
- - 8. The method of claim 7 wherein scores are assigned to words in the document that are indicative of the words being associated with a particular field.
  - 9. The method of claim 7 wherein the informational property is one of a purchase order number, a sender, and a subject.
  - 10. The method of claim 7 wherein the information property is one of a recipient of the document, and wherein the method further comprises routing the document to the recipient.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Rinker, James, Law, Hiu Chung, Viola, Paul
Primary Examiner(s)
Mofiz; Apu M
Assistant Examiner(s)
Nguyen; Cindy

Application Number

US10/835,215
Publication Number

US 20050125402A1
Time in Patent Office

2,133 Days
Field of Search

707/6, 707/2, 707/3, 358/1.15
US Class Current

1/1
CPC Class Codes

G06F 16/93   Document management systems

Y10S 707/99931   Database or file accessing

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Processing an electronic document for information extraction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Processing an electronic document for information extraction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links