Data extraction using templates
First Claim
Patent Images
1. A computer-implemented data analysis method, comprising:
- assigning one or more labels to one or more nodes in object models of respective web pages to provide a plurality of annotated object models;
comparing the plurality of annotated object models;
based on comparing the plurality of annotated object models, forming a plurality of composite object models, the forming including;
for each composite object model of the plurality of object models, determining that two or more of the plurality of annotated object models have at least a specified level of similarity, and in response, storing data from the respective web pages in a single database to form the composite object model, the composite object model based on the two or more annotated object models and reflecting a structure of the web pages as a group;
comparing an object model of a web page to each of the plurality of composite object models;
based on comparing the object model of the web page to each of the plurality of composite object models, identifying a particular composite object model of the plurality of composite object models based on an edit distance between each of the plurality of composite object models and the object model of the web page;
mapping the object model of the web page to the particular composite object model based on a minimum edit distance between the object model of the web page and the particular composite object model;
extracting, from the web page, data associated with nodes in the object model of the web page that correspond to labeled nodes in the particular composite object model based on the mapping; and
providing the extracted data i) for storage in a structured database in a manner associated with the labels and ii) for display by an application executable by a computing device associated with the web page.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and techniques for extracting data from unstructured documents are described. One such method involves assigning one or more labels to one or more nodes in a first object model of a first web page; comparing a second object model of a second web page to the first object model; if the first object model matches the second object model to a determined degree, extracting from the second web page data associated with nodes in the second object model that match labeled nodes in the first object model; and providing the extracted data for storage in a structured database in a manner associated with the labels.
40 Citations
20 Claims
-
1. A computer-implemented data analysis method, comprising:
-
assigning one or more labels to one or more nodes in object models of respective web pages to provide a plurality of annotated object models; comparing the plurality of annotated object models; based on comparing the plurality of annotated object models, forming a plurality of composite object models, the forming including; for each composite object model of the plurality of object models, determining that two or more of the plurality of annotated object models have at least a specified level of similarity, and in response, storing data from the respective web pages in a single database to form the composite object model, the composite object model based on the two or more annotated object models and reflecting a structure of the web pages as a group; comparing an object model of a web page to each of the plurality of composite object models; based on comparing the object model of the web page to each of the plurality of composite object models, identifying a particular composite object model of the plurality of composite object models based on an edit distance between each of the plurality of composite object models and the object model of the web page; mapping the object model of the web page to the particular composite object model based on a minimum edit distance between the object model of the web page and the particular composite object model; extracting, from the web page, data associated with nodes in the object model of the web page that correspond to labeled nodes in the particular composite object model based on the mapping; and providing the extracted data i) for storage in a structured database in a manner associated with the labels and ii) for display by an application executable by a computing device associated with the web page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
-
a computing device; and a non-transitory computer-readable medium coupled to the computing device and having instructions stored thereon which, when executed by the computing device, cause the computing device to perform operations comprising; assigning one or more labels to one or more nodes in object models of respective web pages to provide a plurality of annotated object models; comparing the plurality of annotated object models; based on comparing the plurality of annotated object models, forming a plurality of composite object models, the forming including; for each composite object model of the plurality of object models, determining that two or more of the plurality of annotated object models have at least a specified level of similarity, and in response, storing data from the respective web pages in a single database to form the composite object model, the composite object model based on the two or more annotated object models and reflecting a structure of the web pages as a group; comparing an object model of a web page to each of the plurality of composite object models; based on comparing the object model of the web page to each of the plurality of composite object models, identifying a particular composite object model of the plurality of composite object models based on an edit distance between each of the plurality of composite object models and the object model of the web page; mapping the object model of the web page to the particular composite object model based on a minimum edit distance between the object model of the web page and the particular composite object model; extracting, from the web page, data associated with nodes in the object model of the web page that correspond to labeled nodes in the particular composite object model based on the mapping; and providing the extracted data i) for storage in a structured database in a manner associated with the labels and ii) for display by an application executable by a computing device associated with the web page. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
assigning one or more labels to one or more nodes in object models of respective web pages to provide a plurality of annotated object models; comparing the plurality of annotated object models; based on comparing the plurality of annotated object models, forming a plurality of composite object models, the forming including; for each composite object model of the plurality of object models, determining that two or more of the plurality of annotated object models have at least a specified level of similarity, and in response, storing data from the respective web pages in a single database to form the composite object model, the composite object model based on the two or more annotated object models and reflecting a structure of the web pages as a group; comparing an object model of a web page to each of the plurality of composite object models; based on comparing the object model of the web page to each of the plurality of composite object models, identifying a particular composite object model of the plurality of composite object models based on an edit distance between each of the plurality of composite object models and the object model of the web page; mapping the object model of the web page to the particular composite object model based on a minimum edit distance between the object model of the web page and the particular composite object model; extracting, from the web page, data associated with nodes in the object model of the web page that correspond to labeled nodes in the particular composite object model based on the mapping; and providing the extracted data i) for storage in a structured database in a manner associated with the labels and ii) for display by an application executable by a computing device associated with the web page. - View Dependent Claims (18, 19, 20)
-
Specification