Method and system for automatically extracting data from web sites
First Claim
Patent Images
1. A method for automatically identifying semi-structured data from a semi-structured web site, the method comprising:
- analyzing links and pages on the semi-structured web site using a set of heterogeneous experts, each of the experts focusing on a respective type of structure included in the semi-structured web site;
identifying, by the set of experts, similarities and dissimilarities between the analyzed links and pages;
clustering pages and text segments based on the similarities and dissimilarities identified by at least two experts in the set of heterogeneous experts,wherein each of the at least two experts produces hints indicating whether two items should be together in a cluster, the hints containing respective levels of confidence; and
wherein the clustering text segments comprises;
finding page clusters;
determining a set of text segments for each of the found pare clusters; and
clustering text segments of the set of text segments;
identifying, based on the clustering of pages and text segments, at least some of the semi-structured data to be extracted from the semi-structured web site;
extracting the at least some of the identified semi-structured data; and
transforming the extracted semi-structured data into a relational structured form.
7 Assignments
0 Petitions
Accused Products
Abstract
In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous “experts,” each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as “hints.” Based on these hints, the system may cluster the pages and text segments and identify semi-structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.
-
Citations
20 Claims
-
1. A method for automatically identifying semi-structured data from a semi-structured web site, the method comprising:
-
analyzing links and pages on the semi-structured web site using a set of heterogeneous experts, each of the experts focusing on a respective type of structure included in the semi-structured web site; identifying, by the set of experts, similarities and dissimilarities between the analyzed links and pages; clustering pages and text segments based on the similarities and dissimilarities identified by at least two experts in the set of heterogeneous experts, wherein each of the at least two experts produces hints indicating whether two items should be together in a cluster, the hints containing respective levels of confidence; and wherein the clustering text segments comprises; finding page clusters; determining a set of text segments for each of the found pare clusters; and clustering text segments of the set of text segments; identifying, based on the clustering of pages and text segments, at least some of the semi-structured data to be extracted from the semi-structured web site; extracting the at least some of the identified semi-structured data; and transforming the extracted semi-structured data into a relational structured form. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for automatically identifying semi-structured data from a semi-structured web site by executing instructions stored in a computer-readable memory by a computer processor, the system comprising:
-
means for analyzing links and pages on the semi-structured web site using a set of heterogeneous experts, each of the experts focusing on a respective type of structure included in the semi-structured web site; means for identifying, by the set of experts, similarities and dissimilarities between the analyzed links and pages; means for clustering pages and text segments based on the similarities and dissimilarities identified by at least two experts in the set of heterogeneous experts, wherein each of the at least two experts produces hints indicating whether two items should be together in a cluster, the hints containing respective levels of confidence; and wherein the clustering of text segments comprises; finding page clusters; determining a set of text segments for each of the found pare clusters; and clustering text segments of the set of text segments; means for identifying, based on the clustering of pages and text segments, at least some of the semi-structured data to be extracted from the semi-structured web site; means for extracting the at least some of the identified semi-structured data; and means for transforming the extracted semi-structured data into a relational structured form. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification