Extracting data from semi-structured text documents

US 20060242180A1
Filed: 07/23/2004
Published: 10/26/2006
Est. Priority Date: 07/23/2003
Status: Abandoned Application

First Claim

Patent Images

1-118. -118. (canceled)

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention is a process, system, and workflow for extracting and warehousing data from semi-structured documents in any language. This includes, but is not limited to, one or more of methods for: the automatic building of text mining term models; the optimization or evolution of such text mining term models; the implementation of document specific (or company specific) memory; and the tying or linking of the extracted data, or metadata, once placed in a target electronic document, to the machine readable, underlying source document, thus providing verification and provenance. The process preferably incorporates a wizard-based method for producing pattern recognition text mining term models to extract data from text. The invention also includes a system, method and workflow for handling a subsequent document of similar design and structure, specifically the automatic extraction of target elements and addition of the same to a database.

315 Citations

136 Claims

1-118. -118. (canceled)

119. A method for automating the extraction of information from a semi-structured document characterized by a document type that comprises design and structural characteristics of a set of similar documents, the method comprising:
- designing a target extraction template for the terms of the document type;
  
  supporting the creation of a control set of documents containing the terms manually tagged to the extraction template;
  
  automatically generating a skeleton of extraction model tree for every term;
  
  training the models by automatically optimizing selectors of the term extraction models to the best compliance with the control set tagging; and
  
  using the optimized model to automatically extract information from the document.
- View Dependent Claims (120, 121, 122, 123, 124, 125, 126)
- - 120. The method of claim 119, further comprising using specialized invariants to select generic components of information from the document.
  - 121. The method of claim 119, further comprising tracking and analyzing changes made to initially extracted information and subsequent re-optimization of models.
  - 122. The method of claim 119, further comprising analyzing an additional semi-structured document and updating the model selectors or its structure if a change in accuracy of the term extraction model exceeds a threshold.
  - 123. The method of claim 119, further comprising:
    - (a) retaining specific information about a set of semi-structured documents to serve as a template for new semi-structured document introduction;
      
      (b) comparing any new semi-structured document with a pattern represented by specific information known to be suitable for searching for text based on the retained specific information about the set of semi-structured documents;
      
      (c) assessing if the result of (b) is within a threshold of the result of (a).
  - 124. The method of claim 123, as applied to knowledge that a given company employs similar patterns for subsequent versions of similar documents identifying the company to which the documents pertain.
  - 125. The method of claim 119, in which terms can be assigned a term class for at least one of immediate validation, synonym support, and vocabulary management.
  - 126. The method of claim 119, further comprising automatically comparing first and second extracted data to each other to identify extraction errors.

127. A method of manually tagging and extracting terms from a semi-structured document while automatically collecting key indicators for pattern recognition, in which the tagging is the sole generation point of statistics needed for creation and optimization of an extraction model.

128. A method of using an extraction template having terms to extract data from a semi-structured document having tagged values, comprising providing at least one of:
- a many-to-many relationship between the tagged values and the terms in the extraction template;
  
  a many-to-one relationship between the tagged values and a single term;
  
  or a one-to-may relationship between a single tagged value and a plurality of multiple terms.

129. A method of extracting data from a semi-structured document having a source format, comprising providing a generalized spatial and contextual file format that is independent of the source format.
- View Dependent Claims (130, 131, 132)
- - 130. The method of claim 129, in which the generalized spatial and contextual file format specifies at least one of context on the document, page, table, row, column, and offset.
  - 131. The method of claim 129, in which the semi-structured document is an EDGAR electronic filing and the method further comprises providing at least one of access, navigation, selection, downloading, conversion into the generalized format, and insertion into a document repository.
  - 132. The method of claim 129, in which the semi-structured document is in a format selected from the group consisting of PDF, HTML, and text, and the method further comprises providing at least one of access, navigation, selection, downloading, conversion into the generalized format, and insertion into a document repository.

133. A method of extracting data from a semi-structured source document, comprising providing source links for extracted data at a term level without modifying the source document, and further in which reference to the source document is provided through an abstraction enabled by a generalized intermediate format.

134. A method of quality control in a process of collecting data from a semi-structured source document, comprising providing at least one of document-type specific controls;
- system-wide controls;
  
  automated data cross-checks; and
  
  manual quality assurance measures.
- View Dependent Claims (135, 136)
- - 135. The method of claim 134, in which the document-type specific controls are applied to the extracted content and include at least one of validation of specific data types, application of pre-assigned values, referencing of synonym lists, and application of user-defined validation rules.
  - 136. The method of claim 134, in which providing automated data cross-checks comprises automatically cross-checking currently extracted data against previously extracted data to identify potential data extraction errors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mergent Data Technology Incorporated
Original Assignee
Mergent Data Technology Incorporated
Inventors
Wong, Augustinus Y., Bricker, Elliot I., Levy, Benjamin D. A., Graf, James A., Mikhaylov, Eduard Y., Koroteyev, Vladimir

Application Number

US10/565,611
Publication Number

US 20060242180A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/38 Retrieval characterised by ...

G06F 16/86 Mapping to a database

Extracting data from semi-structured text documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

315 Citations

136 Claims

Specification

Solutions

Use Cases

Quick Links

Extracting data from semi-structured text documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

315 Citations

136 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links