Data extraction using templates

US 8,589,366 B1
Filed: 11/01/2007
Issued: 11/19/2013
Est. Priority Date: 11/01/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented data analysis method, the method comprising:

for each group of web pages;

assigning one or more labels to one or more nodes in object models of respective web pages to provide multiple annotated object models;

comparing multiple annotated object models; and

determining that data from the respective web pages should be stored in a single database, and, in response, forming a composite object model, the composite object model being based on the multiple annotated object models and reflecting a structure of the respective web pages as a group;

identifying an un-annotated web page;

conducting an initial analysis of the un-annotated web page and, based on the initial analysis, identifying the un-annotated web page as a candidate for comparison;

in response to the identifying of the un-annotated web page as the candidate for the comparison, comparing an object model of the un-annotated web page to each of the composite object models by calculating an edit distance between the object model of the un-annotated webpage and each of the composite object models;

determining that a particular composite object model of the composite object models matches the object model based on the edit distance between the particular composite object model and the object model, and in response to the determining that the particular composite object model matches the object model, extracting, from the un-annotated web page, data associated with nodes in the object model that match labeled nodes in the particular composite object model and labeling nodes of the object model of the un-annotated web page based on the labeled nodes of the particular composite object model; and

providing the extracted data for storage in a structured database in a manner associated with the labels.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and techniques for extracting data from unstructured documents are described. One such method involves assigning one or more labels to one or more nodes in a first object model of a first web page; comparing a second object model of a second web page to the first object model; if the first object model matches the second object model to a determined degree, extracting from the second web page data associated with nodes in the second object model that match labeled nodes in the first object model; and providing the extracted data for storage in a structured database in a manner associated with the labels.

61 Citations

View as Search Results

15 Claims

1. A computer-implemented data analysis method, the method comprising:
- for each group of web pages;
  
  assigning one or more labels to one or more nodes in object models of respective web pages to provide multiple annotated object models;
  
  comparing multiple annotated object models; and
  
  determining that data from the respective web pages should be stored in a single database, and, in response, forming a composite object model, the composite object model being based on the multiple annotated object models and reflecting a structure of the respective web pages as a group;
  
  identifying an un-annotated web page;
  
  conducting an initial analysis of the un-annotated web page and, based on the initial analysis, identifying the un-annotated web page as a candidate for comparison;
  
  in response to the identifying of the un-annotated web page as the candidate for the comparison, comparing an object model of the un-annotated web page to each of the composite object models by calculating an edit distance between the object model of the un-annotated webpage and each of the composite object models;
  
  determining that a particular composite object model of the composite object models matches the object model based on the edit distance between the particular composite object model and the object model, and in response to the determining that the particular composite object model matches the object model, extracting, from the un-annotated web page, data associated with nodes in the object model that match labeled nodes in the particular composite object model and labeling nodes of the object model of the un-annotated web page based on the labeled nodes of the particular composite object model; and
  
  providing the extracted data for storage in a structured database in a manner associated with the labels.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein assigning one or more labels comprises manually selecting elements on each of the web pages and manually selecting labels for the selected elements.
  - 3. The method of claim 1, wherein fields in the structured database correspond to the labels.
  - 4. The method of claim 1, wherein each of the object models is provided as a template tree that includes single nodes that comprise repeated or optional data structures from the respective web pages.
  - 5. The method of claim 4, wherein the object models of the web pages comprise Document Object Model (DOM) models.
  - 6. The method of claim 1, further comprising accessing the structured database in response to a search request, and providing one or more search results that include hyperlinks to web pages associated with data in the structured database that is responsive to the search request.
  - 7. The method of claim 1, further comprising identifying the un-annotated web page by crawling a plurality of web pages at a domain corresponding to one of the respective web pages.

8. A computer-implemented system for extracting data from electronic documents, the system comprising:
- one or more processors;
  
  a computer-readable storage medium coupled to the one or more processors and having instructions stored thereon, which, when executed by the one or more processors, provide;
  
  a template generator to create object models of network-accessible documents;
  
  a template labeler to, for each group of network-accessible documents, categorize elements in object models of the network-accessible documents, object models of the network-accessible documents being compared and, a composite object model being formed based on the object models in response to determining that the object models match;
  
  a template comparison module to determine levels of match between a composite labeled template, representative of the composite object model, and unlabeled templates of network-accessible documents;
  
  a document object model (DOM) analyzer to conduct an initial analysis of a network-accessible document and to, based on the initial analysis, identify the network-accessible document as a candidate for comparison,wherein, in response to the identifying of the network-accessible document as the candidate for the comparison, the template comparison module compares a template of the network-accessible document to each of the composite labeled templates by calculating an edit distance between the template of the network-accessible document and each of the composite labeled templates and determines that a particular composite labeled template of the composite object models matches the template of the network-accessible document based on the edit distance between the particular composite labeled template and the template of the network-accessible document, the template labeler labeling the template of the network-accessible document based on labels of the composite labeled template in response to determining that the template of the network-accessible document and the particular composite labeled template match; and
  
  a data extractor that, in response to the determining that the composite labeled template matches the template of the network-accessible document, extracts data from the network-accessible document at locations corresponding to labeled elements in the composite labeled template, and stores the extracted data in a structured database.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The system of claim 8, further comprising a crawler to identify pages at a domain corresponding to pages having labeled templates.
  - 10. The system of claim 8, further comprising a search engine programmed to search the structured database in response to a search request, and to generate search results that include links to network-accessible documents associated with data entries in the structured database.
  - 11. The system of claim 8, wherein the template labeler is programmed to receive manual user labeling of document elements and to match the user labeling to elements in the object models.
  - 12. The system of claim 8, wherein the object models created by the template generator include single nodes that comprise repeated or optional data structures from the respective network-accessible documents.
  - 13. The system of claim 8, wherein the template comparison module comprises a finite-state transducer.

14. A system for extracting data from electronic documents, the system comprising:
- one or more processors;
  
  a computer-readable storage medium coupled to the one or more processors and having instructions stored thereon, which, when executed by the one or more processors, provide;
  
  a template generator to create object models of network-accessible documents;
  
  a template labeler to, for each group of network-accessible documents, categorize elements in object models of the network-accessible documents, object models of the network-accessible documents being compared and a composite object model being formed based on the object models in response to determining that the object models match;
  
  means for comparing document templates to determine a degree of match between a composite labeled template, representative of the composite object model, and unlabeled templates of network-accessible documents;
  
  a document object model (DOM) analyzer to conduct an initial analysis of a network-accessible document and to, based on the initial analysis, identify the network-accessible document as a candidate for comparison,wherein, in response to the identifying of the network-accessible document as the candidate for the comparison, the template comparison module compares a template of the network-accessible document to each of the composite labeled templates by calculating an edit distance between the template of the network-accessible document and each of the composite labeled templates and determines that a particular composite labeled template of the composite object models matches the template of the network-accessible document based on the edit distance between the particular composite labeled template and the template of the network-accessible document, the template labeler labeling the template of the network-accessible document based on labels of the composite labeled template in response to determining that the template of the network-accessible document and the particular composite labeled template match; and
  
  a data extractor that, in response to the determining that the composite labeled template matches the template of the network-accessible document, extracts data from the network-accessible document, at locations associated with labels in the composite labeled template.

15. A computer-implemented data analysis method, the method comprising:
- for each group of web pages;
  
  forming a composite object model based on object models corresponding to a plurality of web pages of the group;
  
  assigning one or more labels to one or more nodes in the composite object model;
  
  conducting an initial analysis of an un-annotated web page and, based on the initial analysis, identifying the un-annotated web page as a candidate for comparison;
  
  in response to the identifying of the un-annotated web page as the candidate for the comparison, comparing an object model of the un-annotated web page to each of the composite object models by calculating an edit-distance between the object model of the un-annotated web page and each of the composite object models;
  
  determining that a particular composite object model of the composite object models matches the object model based on the edit distance between the particular composite object model and the object model, and in response to the determining that the particular composite object model matches the object model, extracting, from the un-annotated web page, data associated with nodes in the object model that match labeled nodes in the particular composite object model and labeling nodes of the object model of the un-annotated web page based on the labeled nodes of the particular composite object model; and
  
  providing the extracted data for storage in a structured database in a manner associated with the labels.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Younes, Haakan, Schafer, Charles F. III
Primary Examiner(s)
Lee, Wilson
Assistant Examiner(s)
Le, Jessica N

Application Number

US11/933,962
Time in Patent Office

2,210 Days
Field of Search

707/101, 707/602, 707/705
US Class Current

707/705
CPC Class Codes

G06F 16/289   Object oriented databases

G06F 16/951   Indexing; Web crawling tech...

G06F 40/143   Markup, e.g. Standard Gener...

Data extraction using templates

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

61 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Data extraction using templates

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

61 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links