Machine learning of document templates for data extraction

US 7,561,734 B1
Filed: 10/23/2006
Issued: 07/14/2009
Est. Priority Date: 03/02/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method in a computer system for learning at least one attribute of a data element within a document, comprising:

receiving from a user by the computer system a boundary of a data element within a document; and

inferring by the computer system at least one attribute of the data element bounded by the boundary,wherein the at least one attribute of the data element is inferred from the boundary of the data element;

wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes; and

wherein each of the one or more contextual attributes comprises;

a total number of words in a context; and

one or more context words, each context word having one or more associated measurements.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present system can perform machine learning of prototypical descriptions of data elements for extraction from machine-readable documents. Document templates are created from sets of training documents that can be used to extract data from form documents, such as: fill-in forms used for taxes; flex-form documents having many variants, such as bills of lading or insurance notifications; and some context-form documents having a description or graphic indicator in proximity to a data element. In response to training documents, the system performs an inductive reasoning process to generalize a document template so that the location of data elements can be predicted for the training examples. The automatically generated document template can then be used to extract data elements from a wide variety of form documents.

104 Citations

View as Search Results

18 Claims

1. A method in a computer system for learning at least one attribute of a data element within a document, comprising:
- receiving from a user by the computer system a boundary of a data element within a document; and
  
  inferring by the computer system at least one attribute of the data element bounded by the boundary,wherein the at least one attribute of the data element is inferred from the boundary of the data element;
  
  wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes; and
  
  wherein each of the one or more contextual attributes comprises;
  
  a total number of words in a context; and
  
  one or more context words, each context word having one or more associated measurements.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the boundary comprises at least one of one or more bounding boxes, one or more bounding polygons, and one or more circles.
  - 3. The method of claim 1, wherein the at least one attribute of the data element includes the placement of the data element within a document.
  - 4. The method of claim 1, wherein the at least one attribute includes one or more physical attributes.
  - 5. The method of claim 4, wherein each of the one or more physical attributes includes one or more from the group consisting of:
    - a range of vertical positions;
      
      a range of horizontal positions;
      
      a range of widths;
      
      a range of heights;
      
      a maximal height; and
      
      a maximal width.
  - 6. The method of claim 1, wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes.
  - 7. The method of claim 6, wherein each of the one or more lexical attributes includes one or more from the group consisting of:
    - a line level description;
      
      a word level description; and
      
      a character level description.
  - 8. The method of claim 1, wherein each of the associated measurements of each of the contextual attributes includes one or more from the group consisting of:
    - a pixel distance measurement;
      
      a word distance measurement; and
      
      a utility measurement.
  - 9. The method of claim 6, wherein each of the one or more control attributes includes one or more from the group consisting of:
    - a word type;
      
      a data element identifier;
      
      a generalization counter; and
      
      an index of a line, word, or character.

10. A system for learning at least one attribute of a data element within a document, comprising:
- means for receiving a boundary of a data element within a document; and
  
  means for inferring at least one attribute of the data element bounded by the boundary,wherein the at least one attribute of the data element is inferred from the boundary of the data element;
  
  wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes; and
  
  wherein each of the one or more contextual attributes comprises;
  
  a total number of words in a context; and
  
  one or more context words, each context word having one or more associated measurements.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein the boundary comprises at least one of one or more bounding boxes, one or more bounding polygons, and one or more circles.
  - 12. The system of claim 10, wherein the at least one attribute of the data element includes the placement of the data element within a document.
  - 13. The system of claim 10, wherein the at least one attribute includes one or more physical attributes.
  - 14. The system of claim 13, wherein each of the one or more physical attributes includes one or more from the group consisting of:
    - a range of vertical positions;
      
      a range of horizontal positions;
      
      a range of widths;
      
      a range of heights;
      
      a maximal height; and
      
      a maximal width.
  - 15. The system of claim 10, wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes.
  - 16. The system of claim 15, wherein each of the one or more lexical attributes includes one or more from the group consisting of:
    - a line level description;
      
      a word level description; and
      
      a character level description.
  - 17. The system of claim 10, wherein each of the associated measurements of each of the contextual attributes includes one or more from the group consisting of:
    - a pixel distance measurement;
      
      a word distance measurement; and
      
      a utility measurement.
  - 18. The system of claim 15, wherein each of the one or more control attributes includes one or more from the group consisting of:
    - a word type;
      
      a data element identifier;
      
      a generalization counter; and
      
      an index of a line, word, or character.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Leidos, Inc. (Leidos Holdings, Inc.)
Original Assignee
Science Applications International Corporation
Inventors
Wnek, Janusz
Primary Examiner(s)
SHERALI, ISHRAT I

Application Number

US11/584,536
Time in Patent Office

995 Days
Field of Search

382/159, 382/173, 382/199, 382177-176, 382/181, 715209-255, 704 1- 10
US Class Current

382/159
CPC Class Codes

G06F 40/174   Form filling; Merging

G06V 30/10   Character recognition

G06V 30/1444   Selective acquisition, loca...

G06V 30/412   Layout analysis of document...

G06V 30/416   Extracting the logical stru...

Machine learning of document templates for data extraction

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

104 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Machine learning of document templates for data extraction

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

104 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links