Machine learning of document templates for data extraction
First Claim
1. A method in a computer system for learning at least one attribute of a data element within a document, comprising:
- receiving from a user by the computer system a boundary of a data element within a document; and
inferring by the computer system at least one attribute of the data element bounded by the boundary,wherein the at least one attribute of the data element is inferred from the boundary of the data element;
wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes; and
wherein each of the one or more contextual attributes comprises;
a total number of words in a context; and
one or more context words, each context word having one or more associated measurements.
6 Assignments
0 Petitions
Accused Products
Abstract
The present system can perform machine learning of prototypical descriptions of data elements for extraction from machine-readable documents. Document templates are created from sets of training documents that can be used to extract data from form documents, such as: fill-in forms used for taxes; flex-form documents having many variants, such as bills of lading or insurance notifications; and some context-form documents having a description or graphic indicator in proximity to a data element. In response to training documents, the system performs an inductive reasoning process to generalize a document template so that the location of data elements can be predicted for the training examples. The automatically generated document template can then be used to extract data elements from a wide variety of form documents.
104 Citations
18 Claims
-
1. A method in a computer system for learning at least one attribute of a data element within a document, comprising:
-
receiving from a user by the computer system a boundary of a data element within a document; and inferring by the computer system at least one attribute of the data element bounded by the boundary, wherein the at least one attribute of the data element is inferred from the boundary of the data element; wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes; and wherein each of the one or more contextual attributes comprises; a total number of words in a context; and one or more context words, each context word having one or more associated measurements. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for learning at least one attribute of a data element within a document, comprising:
-
means for receiving a boundary of a data element within a document; and means for inferring at least one attribute of the data element bounded by the boundary, wherein the at least one attribute of the data element is inferred from the boundary of the data element; wherein the at least one attribute includes at least one of one or more lexical attributes, one or more contextual attributes, and one or more control attributes; and wherein each of the one or more contextual attributes comprises; a total number of words in a context; and one or more context words, each context word having one or more associated measurements. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification