Data extraction using templates
First Claim
Patent Images
1. A computer-implemented data analysis method, the method comprising:
- for each group of web pages;
assigning one or more labels to one or more nodes in object models of respective web pages to provide multiple annotated object models;
comparing multiple annotated object models; and
determining that data from the respective web pages should be stored in a single database, and, in response, forming a composite object model, the composite object model being based on the multiple annotated object models and reflecting a structure of the respective web pages as a group;
identifying an un-annotated web page;
conducting an initial analysis of the un-annotated web page and, based on the initial analysis, identifying the un-annotated web page as a candidate for comparison;
in response to the identifying of the un-annotated web page as the candidate for the comparison, comparing an object model of the un-annotated web page to each of the composite object models by calculating an edit distance between the object model of the un-annotated webpage and each of the composite object models;
determining that a particular composite object model of the composite object models matches the object model based on the edit distance between the particular composite object model and the object model, and in response to the determining that the particular composite object model matches the object model, extracting, from the un-annotated web page, data associated with nodes in the object model that match labeled nodes in the particular composite object model and labeling nodes of the object model of the un-annotated web page based on the labeled nodes of the particular composite object model; and
providing the extracted data for storage in a structured database in a manner associated with the labels.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and techniques for extracting data from unstructured documents are described. One such method involves assigning one or more labels to one or more nodes in a first object model of a first web page; comparing a second object model of a second web page to the first object model; if the first object model matches the second object model to a determined degree, extracting from the second web page data associated with nodes in the second object model that match labeled nodes in the first object model; and providing the extracted data for storage in a structured database in a manner associated with the labels.
61 Citations
15 Claims
-
1. A computer-implemented data analysis method, the method comprising:
-
for each group of web pages; assigning one or more labels to one or more nodes in object models of respective web pages to provide multiple annotated object models; comparing multiple annotated object models; and determining that data from the respective web pages should be stored in a single database, and, in response, forming a composite object model, the composite object model being based on the multiple annotated object models and reflecting a structure of the respective web pages as a group; identifying an un-annotated web page; conducting an initial analysis of the un-annotated web page and, based on the initial analysis, identifying the un-annotated web page as a candidate for comparison; in response to the identifying of the un-annotated web page as the candidate for the comparison, comparing an object model of the un-annotated web page to each of the composite object models by calculating an edit distance between the object model of the un-annotated webpage and each of the composite object models; determining that a particular composite object model of the composite object models matches the object model based on the edit distance between the particular composite object model and the object model, and in response to the determining that the particular composite object model matches the object model, extracting, from the un-annotated web page, data associated with nodes in the object model that match labeled nodes in the particular composite object model and labeling nodes of the object model of the un-annotated web page based on the labeled nodes of the particular composite object model; and providing the extracted data for storage in a structured database in a manner associated with the labels. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented system for extracting data from electronic documents, the system comprising:
-
one or more processors; a computer-readable storage medium coupled to the one or more processors and having instructions stored thereon, which, when executed by the one or more processors, provide; a template generator to create object models of network-accessible documents; a template labeler to, for each group of network-accessible documents, categorize elements in object models of the network-accessible documents, object models of the network-accessible documents being compared and, a composite object model being formed based on the object models in response to determining that the object models match; a template comparison module to determine levels of match between a composite labeled template, representative of the composite object model, and unlabeled templates of network-accessible documents; a document object model (DOM) analyzer to conduct an initial analysis of a network-accessible document and to, based on the initial analysis, identify the network-accessible document as a candidate for comparison, wherein, in response to the identifying of the network-accessible document as the candidate for the comparison, the template comparison module compares a template of the network-accessible document to each of the composite labeled templates by calculating an edit distance between the template of the network-accessible document and each of the composite labeled templates and determines that a particular composite labeled template of the composite object models matches the template of the network-accessible document based on the edit distance between the particular composite labeled template and the template of the network-accessible document, the template labeler labeling the template of the network-accessible document based on labels of the composite labeled template in response to determining that the template of the network-accessible document and the particular composite labeled template match; and a data extractor that, in response to the determining that the composite labeled template matches the template of the network-accessible document, extracts data from the network-accessible document at locations corresponding to labeled elements in the composite labeled template, and stores the extracted data in a structured database. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A system for extracting data from electronic documents, the system comprising:
-
one or more processors; a computer-readable storage medium coupled to the one or more processors and having instructions stored thereon, which, when executed by the one or more processors, provide; a template generator to create object models of network-accessible documents; a template labeler to, for each group of network-accessible documents, categorize elements in object models of the network-accessible documents, object models of the network-accessible documents being compared and a composite object model being formed based on the object models in response to determining that the object models match; means for comparing document templates to determine a degree of match between a composite labeled template, representative of the composite object model, and unlabeled templates of network-accessible documents; a document object model (DOM) analyzer to conduct an initial analysis of a network-accessible document and to, based on the initial analysis, identify the network-accessible document as a candidate for comparison, wherein, in response to the identifying of the network-accessible document as the candidate for the comparison, the template comparison module compares a template of the network-accessible document to each of the composite labeled templates by calculating an edit distance between the template of the network-accessible document and each of the composite labeled templates and determines that a particular composite labeled template of the composite object models matches the template of the network-accessible document based on the edit distance between the particular composite labeled template and the template of the network-accessible document, the template labeler labeling the template of the network-accessible document based on labels of the composite labeled template in response to determining that the template of the network-accessible document and the particular composite labeled template match; and a data extractor that, in response to the determining that the composite labeled template matches the template of the network-accessible document, extracts data from the network-accessible document, at locations associated with labels in the composite labeled template.
-
-
15. A computer-implemented data analysis method, the method comprising:
-
for each group of web pages; forming a composite object model based on object models corresponding to a plurality of web pages of the group; assigning one or more labels to one or more nodes in the composite object model; conducting an initial analysis of an un-annotated web page and, based on the initial analysis, identifying the un-annotated web page as a candidate for comparison; in response to the identifying of the un-annotated web page as the candidate for the comparison, comparing an object model of the un-annotated web page to each of the composite object models by calculating an edit-distance between the object model of the un-annotated web page and each of the composite object models; determining that a particular composite object model of the composite object models matches the object model based on the edit distance between the particular composite object model and the object model, and in response to the determining that the particular composite object model matches the object model, extracting, from the un-annotated web page, data associated with nodes in the object model that match labeled nodes in the particular composite object model and labeling nodes of the object model of the un-annotated web page based on the labeled nodes of the particular composite object model; and providing the extracted data for storage in a structured database in a manner associated with the labels.
-
Specification