×

Method and system for automatically extracting data from web sites

  • US 8,843,490 B2
  • Filed: 07/26/2011
  • Issued: 09/23/2014
  • Est. Priority Date: 07/15/2005
  • Status: Active Grant
First Claim
Patent Images

1. A method for automatically identifying semi-structured data from a semi-structured web site, the method comprising:

  • analyzing links and pages on the semi-structured web site using a set of heterogeneous experts, each of the experts focusing on a respective type of structure included in the semi-structured web site;

    identifying, by the set of experts, similarities and dissimilarities between the analyzed links and pages;

    clustering pages and text segments based on the similarities and dissimilarities identified by at least two experts in the set of heterogeneous experts,wherein each of the at least two experts produces hints indicating whether two items should be together in a cluster, the hints containing respective levels of confidence; and

    wherein the clustering text segments comprises;

    finding page clusters;

    determining a set of text segments for each of the found pare clusters; and

    clustering text segments of the set of text segments;

    identifying, based on the clustering of pages and text segments, at least some of the semi-structured data to be extracted from the semi-structured web site;

    extracting the at least some of the identified semi-structured data; and

    transforming the extracted semi-structured data into a relational structured form.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×