AUTOMATIC EXTRACTION USING MACHINE LEARNING BASED ROBUST STRUCTURAL EXTRACTORS
First Claim
1. A computer-implemented method comprising:
- producing a trained machine learning model based at least in part on a plurality of documents;
applying the trained machine learning model to a set of documents;
based at least in part on the applying the trained machine learning model to the set of documents, determining a plurality of locations of a particular attribute in the set of documents;
associating a set of locations with the particular attribute, based at least in part on the plurality of locations; and
based at least in part on the set of locations, extracting, from a particular document, an attribute value corresponding to the particular attribute;
wherein the method is performed by one or more computing devices programmed to be special purpose machines pursuant to program instructions.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for automatically extracting information from a large number of documents through applying machine learning techniques and exploiting structural similarities among documents. A machine learning model is trained to have at least 50% accuracy. The trained machine learning model is used to identify information attributes in a sample of pages from a cluster of structurally similar documents. A structure-specific model of the cluster is created by compiling a list of top-K locations for each attribute identified by the trained machine learning model in the sample. These top-K lists are used to extract information from the pages of the cluster from which the sample of pages was taken.
-
Citations
20 Claims
-
1. A computer-implemented method comprising:
-
producing a trained machine learning model based at least in part on a plurality of documents; applying the trained machine learning model to a set of documents; based at least in part on the applying the trained machine learning model to the set of documents, determining a plurality of locations of a particular attribute in the set of documents; associating a set of locations with the particular attribute, based at least in part on the plurality of locations; and based at least in part on the set of locations, extracting, from a particular document, an attribute value corresponding to the particular attribute; wherein the method is performed by one or more computing devices programmed to be special purpose machines pursuant to program instructions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification