EXTRACTING DATA CONTENT ITEMS USING TEMPLATE MATCHING
First Claim
1. One or more computer storage media having computer-executable instructions embodied thereon for performing a method for extracting data content items from web pages, the method comprising:
- receiving a first web page having one or more data content items associated therewith;
receiving an indication to label at least one of the data content items associated with the first web page;
generating a Document Object Model (DOM) tree associated with the first web page, the DOM tree having a node associated with each data content item;
labeling the node of the DOM tree associated with the at least one indicated data content item to generate a template DOM tree;
comparing the template DOM tree with a DOM tree associated with a second web page to determine alignment there between; and
if it is determined that a node of the DOM tree associated with the second web page aligns with the labeled node associated with the template DOM tree, extracting a data content item from the second web page that is associated with the aligned node of the DOM tree.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for extracting data content items from a web page are provided. A template is created by labeling data content items of interest associated with a web page and generating a template Document Object Model (DOM) tree based on the labeled web page. DOM trees are also generated for additional web pages that contain data content items for which extraction may be desired. These DOM trees are compared to the template DOM tree to determine alignment there between. The aligned data content items may then be extracted from the additional web pages and indexed, as desired. Labeling the data content items of interest prior to generating a template DOM tree allows for the desired data content items to be specified and more accurately extracted from related and/or similarly structured web pages.
-
Citations
20 Claims
-
1. One or more computer storage media having computer-executable instructions embodied thereon for performing a method for extracting data content items from web pages, the method comprising:
-
receiving a first web page having one or more data content items associated therewith; receiving an indication to label at least one of the data content items associated with the first web page; generating a Document Object Model (DOM) tree associated with the first web page, the DOM tree having a node associated with each data content item; labeling the node of the DOM tree associated with the at least one indicated data content item to generate a template DOM tree; comparing the template DOM tree with a DOM tree associated with a second web page to determine alignment there between; and if it is determined that a node of the DOM tree associated with the second web page aligns with the labeled node associated with the template DOM tree, extracting a data content item from the second web page that is associated with the aligned node of the DOM tree. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer system embodied on at least one computer storage media having computer-executable instructions embodied thereon for performing a method for extracting data content items from web pages, the system comprising:
-
a receiving component configured for receiving a plurality of web pages, each web page having at least one data content item associated therewith; a Document Object Model (DOM) tree generating component configured for generating a DOM tree associated with one or more of the received web pages; a labeling component configured for labeling at least one node associated with a generated DOM tree in accordance with a received labeling indication; a comparing component configured for comparing a first DOM tree having at least one labeled node associated therewith with a second DOM tree; and an extracting component configured for extracting at least one data content item associated with the second DOM tree in accordance with the at least one labeled note associated with the first DOM tree. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A method in a computing environment for extracting data content items from a web page, at least two of the data content items having a repeated pattern, the method comprising:
-
receiving a first web page having a plurality of data content items associated therewith; receiving an indication to label at least two of the plurality of data content items, wherein the at least two of the plurality of data content items have a repeated pattern; generating a Document Object Model (DOM) tree associated with the first web page, the DOM tree having a node associated with each of the plurality of data content items; labeling the nodes of the DOM tree to create a template DOM tree, wherein a node associated with one of the at least two data content items having a repeated pattern is labeled as a repeat node; comparing the template DOM tree with a DOM tree associated with a second web page to determine alignment there between; and if it is determined that a node of the DOM tree associated with the second web page aligns with the labeled node associated with the template DOM tree, extracting one or more data content items from the second web page that are associated with the aligned repeat node of the DOM tree. - View Dependent Claims (17, 18, 19, 20)
-
Specification