×

Method and apparatus for extracting structured data from HTML pages

  • US 7,073,122 B1
  • Filed: 09/08/2000
  • Issued: 07/04/2006
  • Est. Priority Date: 09/08/2000
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for extracting structured data from HTML pages, comprising the steps of:

  • (a) parsing the input HTML file using a standard HTML parser, thereby creating a parse tree;

    (b) annotating the parse tree generated in step (a), thereby creating an annotated parse tree;

    (c) creating an array of nodes from the annotated parse tree generated in step (b) and a set of constraints, thereby creating a filtered node array; and

    (d) generating an instance tree by instantiating a given structure template with respect to the filtered node array generated by step (c), wherein the step of generating an instance tree further includes the step of instantiating a composite structure element with respect to a subsequence of the nodes in the filtered node array, thereby creating a composite instance, said step of instantiating a composite structure element includes the step of successively instantiating each of the children of said composite structure element with contiguous subsequences of the subsequence of the filtered node array;

    wherein step (c) further comprises the steps of;

    (i) examining each node of said annotated parse tree;

    (ii) comparing each node to at least one constraint of said set of constraints, each said constraint setting forth a data format requirement;

    (iii) accepting each node matching at least one constraint of said set of constraints into said filtered node array; and

    (iv) excluding each node not matching at least one constraint of said set of constraints from said filtered node array; and

    whereinthe structure template comprises a plurality of structure elements arranged in a hierarchy;

    each structure element in said plurality of structure elements is selected from the group consisting of an ExField structure element, a composite structure element, a repeat structure element, and a choice structure element;

    said structure elements include at least one ExField structure element;

    said instance tree comprises a plurality of instances arranged in a hierarchy;

    each instance in said plurality of instances is selected from the group consisting of an ExField instance, a composite instance, and a repeat instance; and

    said instances include at least one ExField instance.

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×