×

Data extraction using templates

  • US 8,589,366 B1
  • Filed: 11/01/2007
  • Issued: 11/19/2013
  • Est. Priority Date: 11/01/2007
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented data analysis method, the method comprising:

  • for each group of web pages;

    assigning one or more labels to one or more nodes in object models of respective web pages to provide multiple annotated object models;

    comparing multiple annotated object models; and

    determining that data from the respective web pages should be stored in a single database, and, in response, forming a composite object model, the composite object model being based on the multiple annotated object models and reflecting a structure of the respective web pages as a group;

    identifying an un-annotated web page;

    conducting an initial analysis of the un-annotated web page and, based on the initial analysis, identifying the un-annotated web page as a candidate for comparison;

    in response to the identifying of the un-annotated web page as the candidate for the comparison, comparing an object model of the un-annotated web page to each of the composite object models by calculating an edit distance between the object model of the un-annotated webpage and each of the composite object models;

    determining that a particular composite object model of the composite object models matches the object model based on the edit distance between the particular composite object model and the object model, and in response to the determining that the particular composite object model matches the object model, extracting, from the un-annotated web page, data associated with nodes in the object model that match labeled nodes in the particular composite object model and labeling nodes of the object model of the un-annotated web page based on the labeled nodes of the particular composite object model; and

    providing the extracted data for storage in a structured database in a manner associated with the labels.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×