METHOD AND SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM WEB SITES
7 Assignments
0 Petitions
Accused Products
Abstract
In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous “experts,” each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as “hints.” Based on these hints, the system may cluster the pages and text segments and identify semi-structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.
-
Citations
70 Claims
-
1-41. -41. (canceled)
-
42. A method for automatically identifying semi-structured data from a semi-structured web site, the method comprising:
executing instructions stored in memory by a processor for; developing a set of experts; analyzing links and pages on the semi-structured web site by means of the set of experts; identifying predetermined types of generic structures by means of the set of experts; clustering pages and text segments within the pages based on the identified structures; and identifying, based on the clustering, the semi-structured data that can be extracted. - View Dependent Claims (43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
-
56. A method for determining a relational form of data from a semi-structured web site, the method comprising:
executing instructions stored in memory by a processor for; spidering the semi-structured web site to obtain a subject set of pages, including links on each of the subject set of pages; discovering low-level structures of the pages, text segments, and the links on the semi-structured web site by means of a set of experts, the set of experts being heterogeneous; clustering the pages and text segments to determine a consistent global structure to produce page and text segment clusters; and determining a relational form of the data from the page and text segment clusters. - View Dependent Claims (57, 58, 59, 60, 61, 62)
-
63. A non-transitory computer readable storage medium having embodied thereon a program, the program being executable by a processor to perform a method for automatically identifying data from a semi-structured web site, the method comprising:
-
developing experts; analyzing links and pages on the web site by means of the experts; identifying predetermined types of generic structures by means of the experts; clustering pages and text segments within the pages based on the identified structures; and identifying, based on the clustering, the data that can be extracted. - View Dependent Claims (64, 65)
-
-
66. A system for automatically identifying data from a semi-structured web site, the system comprising:
a processor for executing instructions for; developing experts; analyzing the links and pages on the website web site by means of the experts; identifying predetermined types of generic structures by means of the experts; clustering pages and text segments within the pages based on the identified structures; and identifying, based on the clustering, the data that can be extracted. - View Dependent Claims (67, 68, 69, 70)
Specification