×

Systems and methods for extracting information from structured documents

  • US 8,090,678 B1
  • Filed: 07/23/2003
  • Issued: 01/03/2012
  • Est. Priority Date: 07/23/2003
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:

  • presenting a list of web sites to a user;

    receiving one or more of the web sites selected from the user for data extraction;

    collecting a plurality of co-occurring different HTML structured documents for each of the selected web sites at a computer comprising a processor;

    forming a plurality of clusters comprising different subsets of the co-occurring HTML structured documents, wherein;

    each cluster comprises a different HTML structured document of the plurality of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the plurality of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document,the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster,an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the particular centroid document in a different structured location;

    displaying a list of clusters;

    displaying the centroid document of a particular cluster selected from the list of clusters;

    marking a data element on the centroid document of the particular cluster;

    identifying a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and

    providing a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×