Extracting data from semi-structured text documents
2 Assignments
0 Petitions
Accused Products
Abstract
The invention is a process, system, and workflow for extracting and warehousing data from semi-structured documents in any language. This includes, but is not limited to, one or more of methods for: the automatic building of text mining term models; the optimization or evolution of such text mining term models; the implementation of document specific (or company specific) memory; and the tying or linking of the extracted data, or metadata, once placed in a target electronic document, to the machine readable, underlying source document, thus providing verification and provenance. The process preferably incorporates a wizard-based method for producing pattern recognition text mining term models to extract data from text. The invention also includes a system, method and workflow for handling a subsequent document of similar design and structure, specifically the automatic extraction of target elements and addition of the same to a database.
315 Citations
136 Claims
-
1-118. -118. (canceled)
-
119. A method for automating the extraction of information from a semi-structured document characterized by a document type that comprises design and structural characteristics of a set of similar documents, the method comprising:
- designing a target extraction template for the terms of the document type;
supporting the creation of a control set of documents containing the terms manually tagged to the extraction template;
automatically generating a skeleton of extraction model tree for every term;
training the models by automatically optimizing selectors of the term extraction models to the best compliance with the control set tagging; and
using the optimized model to automatically extract information from the document. - View Dependent Claims (120, 121, 122, 123, 124, 125, 126)
- designing a target extraction template for the terms of the document type;
-
127. A method of manually tagging and extracting terms from a semi-structured document while automatically collecting key indicators for pattern recognition, in which the tagging is the sole generation point of statistics needed for creation and optimization of an extraction model.
-
128. A method of using an extraction template having terms to extract data from a semi-structured document having tagged values, comprising providing at least one of:
- a many-to-many relationship between the tagged values and the terms in the extraction template;
a many-to-one relationship between the tagged values and a single term;
or a one-to-may relationship between a single tagged value and a plurality of multiple terms.
- a many-to-many relationship between the tagged values and the terms in the extraction template;
- 129. A method of extracting data from a semi-structured document having a source format, comprising providing a generalized spatial and contextual file format that is independent of the source format.
-
133. A method of extracting data from a semi-structured source document, comprising providing source links for extracted data at a term level without modifying the source document, and further in which reference to the source document is provided through an abstraction enabled by a generalized intermediate format.
-
134. A method of quality control in a process of collecting data from a semi-structured source document, comprising providing at least one of document-type specific controls;
- system-wide controls;
automated data cross-checks; and
manual quality assurance measures. - View Dependent Claims (135, 136)
- system-wide controls;
Specification