Method and apparatus for extracting structured data from HTML pages
First Claim
1. A method for extracting structured data from HTML pages, comprising the steps of:
- (a) parsing the input HTML file using a standard HTML parser, thereby creating a parse tree;
(b) annotating the parse tree generated in step (a), thereby creating an annotated parse tree;
(c) creating an array of nodes from the annotated parse tree generated in step (b) and a set of constraints, thereby creating a filtered node array; and
(d) generating an instance tree by instantiating a given structure template with respect to the filtered node array generated by step (c), wherein the step of generating an instance tree further includes the step of instantiating a composite structure element with respect to a subsequence of the nodes in the filtered node array, thereby creating a composite instance, said step of instantiating a composite structure element includes the step of successively instantiating each of the children of said composite structure element with contiguous subsequences of the subsequence of the filtered node array;
wherein step (c) further comprises the steps of;
(i) examining each node of said annotated parse tree;
(ii) comparing each node to at least one constraint of said set of constraints, each said constraint setting forth a data format requirement;
(iii) accepting each node matching at least one constraint of said set of constraints into said filtered node array; and
(iv) excluding each node not matching at least one constraint of said set of constraints from said filtered node array; and
whereinthe structure template comprises a plurality of structure elements arranged in a hierarchy;
each structure element in said plurality of structure elements is selected from the group consisting of an ExField structure element, a composite structure element, a repeat structure element, and a choice structure element;
said structure elements include at least one ExField structure element;
said instance tree comprises a plurality of instances arranged in a hierarchy;
each instance in said plurality of instances is selected from the group consisting of an ExField instance, a composite instance, and a repeat instance; and
said instances include at least one ExField instance.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for extracting structured data from HTML pages whereby an HTML file belonging to a pre-determined class of HTML files can be transformed into an instance tree (142). Other than the HTML file, there are two other inputs to the extraction procedure: a set of constraints (134), and a structure template (140). The steps in the process include: parsing the HTML file, thereby creating a parse tree (126); annotating the parse tree, thereby creating an annotated parse tree (130); creating an array of nodes from the annotated parse tree using a set of constraints (134); and generating an instance tree (142) from the array of nodes using the structure template (140). The instance tree (142) encodes, in a form that may be used by other computer programs, all the relevant information in the HTML file as prescribed by the set of constraints (134) and makes explicit the structure of this information.
72 Citations
18 Claims
-
1. A method for extracting structured data from HTML pages, comprising the steps of:
-
(a) parsing the input HTML file using a standard HTML parser, thereby creating a parse tree; (b) annotating the parse tree generated in step (a), thereby creating an annotated parse tree; (c) creating an array of nodes from the annotated parse tree generated in step (b) and a set of constraints, thereby creating a filtered node array; and (d) generating an instance tree by instantiating a given structure template with respect to the filtered node array generated by step (c), wherein the step of generating an instance tree further includes the step of instantiating a composite structure element with respect to a subsequence of the nodes in the filtered node array, thereby creating a composite instance, said step of instantiating a composite structure element includes the step of successively instantiating each of the children of said composite structure element with contiguous subsequences of the subsequence of the filtered node array; wherein step (c) further comprises the steps of; (i) examining each node of said annotated parse tree; (ii) comparing each node to at least one constraint of said set of constraints, each said constraint setting forth a data format requirement; (iii) accepting each node matching at least one constraint of said set of constraints into said filtered node array; and (iv) excluding each node not matching at least one constraint of said set of constraints from said filtered node array; and
whereinthe structure template comprises a plurality of structure elements arranged in a hierarchy; each structure element in said plurality of structure elements is selected from the group consisting of an ExField structure element, a composite structure element, a repeat structure element, and a choice structure element; said structure elements include at least one ExField structure element; said instance tree comprises a plurality of instances arranged in a hierarchy; each instance in said plurality of instances is selected from the group consisting of an ExField instance, a composite instance, and a repeat instance; and said instances include at least one ExField instance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for extracting structured data from HTML pages, comprising the steps of:
-
(a) parsing the input HTML file using a standard HTML parser, thereby creating a parse tree; (b) annotating the parse tree generated in step (a), thereby creating an annotated parse tree; (c) creating an array of nodes from the annotated parse tree generated in step (b) and a set of constraints, thereby creating a filtered node array; and (d) generating an instance tree by instantiating a given structure template with respect to the filtered node array generated by step (c), wherein the step of generating an instance tree further includes the steps of recursively comparing each structure element of a composite structure element in the given structure template to the nodes in said filtered node array to determine whether each structure element in the composite structure element is matched to a corresponding composite structure in said filtered node array and adding any said corresponding composite structure to the instance tree. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
Specification