METHOD AND SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM WEB SITES
First Claim
1. A method for automatically extracting and structuring the data from a semi-structured web site, the method comprising:
- developing a set of experts;
analyzing the links and pages on the website by means of the experts;
identifying predetermined types of generic structures by means of the experts;
clustering pages and text segments within the pages based on the identified structures; and
identifying, based on the clustering, the semi-structured data that can be extracted.
6 Assignments
0 Petitions
Accused Products
Abstract
In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous “experts,” each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as “hints.” Based on these hints, the system may cluster the pages and text segments and identify semi-structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.
-
Citations
41 Claims
-
1. A method for automatically extracting and structuring the data from a semi-structured web site, the method comprising:
-
developing a set of experts;
analyzing the links and pages on the website by means of the experts;
identifying predetermined types of generic structures by means of the experts;
clustering pages and text segments within the pages based on the identified structures; and
identifying, based on the clustering, the semi-structured data that can be extracted. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A method for automatically extracting and structuring the data from a semi-structured web site, the method comprising:
-
spidering the web site to obtain a subject set of HTML pages on the web site, including the links on each page;
discovering low-level structure of the pages, text segments, and links on the web site by means of a set of heterogeneous experts;
clustering the pages and text segments to determine a consistent global structure; and
determining the relational form of the data from the page and text segment clusters. - View Dependent Claims (28, 29, 30, 31, 32, 33)
-
-
34. A computer program product for use with a computer system for automatically extracting and structuring the data from a semi-structured web site, the computer program product comprising:
-
a computer-readable medium;
means, provided on the computer-readable medium, for developing a set of experts;
means, provided on the computer-readable medium, for analyzing the links and pages on the web site by means of the experts;
means, provided on the computer-readable medium, for identifying predetermined types of generic structures by means of the experts;
means, provided on the computer-readable medium, for clustering pages and text segments within the pages based on the identified structures; and
means, provided on the computer-readable medium, for identifying, based on the clustering, the semi-structured data that can be extracted. - View Dependent Claims (35, 36)
-
-
37. A system for automatically extracting and structuring the data from a semi-structured web site, the system comprising:
-
means for developing a set of experts;
means for analyzing the links and pages on the website by means of the experts;
means for identifying predetermined types of generic structures by means of the experts;
means for clustering pages and text segments within the pages based on the identified structures; and
means for identifying, based on the clustering, the semi-structured data that can be extracted. - View Dependent Claims (38, 39, 40, 41)
-
Specification