Method and system for automatically extracting data from web sites

US 8,843,490 B2
Filed: 07/26/2011
Issued: 09/23/2014
Est. Priority Date: 07/15/2005
Status: Active Grant

First Claim

Patent Images

1. A method for automatically identifying semi-structured data from a semi-structured web site, the method comprising:

analyzing links and pages on the semi-structured web site using a set of heterogeneous experts, each of the experts focusing on a respective type of structure included in the semi-structured web site;

identifying, by the set of experts, similarities and dissimilarities between the analyzed links and pages;

clustering pages and text segments based on the similarities and dissimilarities identified by at least two experts in the set of heterogeneous experts,wherein each of the at least two experts produces hints indicating whether two items should be together in a cluster, the hints containing respective levels of confidence; and

wherein the clustering text segments comprises;

finding page clusters;

determining a set of text segments for each of the found pare clusters; and

clustering text segments of the set of text segments;

identifying, based on the clustering of pages and text segments, at least some of the semi-structured data to be extracted from the semi-structured web site;

extracting the at least some of the identified semi-structured data; and

transforming the extracted semi-structured data into a relational structured form.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous “experts,” each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as “hints.” Based on these hints, the system may cluster the pages and text segments and identify semi-structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.

Citations

20 Claims

1. A method for automatically identifying semi-structured data from a semi-structured web site, the method comprising:
- analyzing links and pages on the semi-structured web site using a set of heterogeneous experts, each of the experts focusing on a respective type of structure included in the semi-structured web site;
  
  identifying, by the set of experts, similarities and dissimilarities between the analyzed links and pages;
  
  clustering pages and text segments based on the similarities and dissimilarities identified by at least two experts in the set of heterogeneous experts,wherein each of the at least two experts produces hints indicating whether two items should be together in a cluster, the hints containing respective levels of confidence; and
  
  wherein the clustering text segments comprises;
  
  finding page clusters;
  
  determining a set of text segments for each of the found pare clusters; and
  
  clustering text segments of the set of text segments;
  
  identifying, based on the clustering of pages and text segments, at least some of the semi-structured data to be extracted from the semi-structured web site;
  
  extracting the at least some of the identified semi-structured data; and
  
  transforming the extracted semi-structured data into a relational structured form.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising evaluating a probability of a clustering based on the hints to determine a quality of the clustering.
  - 3. The method of claim 1, wherein the clustering of pages and text segments provides at least two alternative clusterings.
  - 4. The method of claim 3, further comprising employing probabilistic models to rate the alternative clusterings.
  - 5. The method of claim 1, further comprising employing a generative probabilistic model to enable assignment of probabilities to the hints in view of a clustering.
  - 6. The method of claim 5 wherein all hints are assigned the probabilities.
  - 7. The method of claim 6, wherein probabilities of page hints are determined from page clusters.
  - 8. The method of claim 1, further comprising adding to the hints a binary hint that indicates that a particular pair of items are in the same cluster.
  - 9. The method of claim 8, further comprising extending a constraint language for constraint clustering, wherein constraints for the constraint clustering are defined in a form of must-link or cannot-link pairs.
  - 10. The method of claim 9, further comprising extending the constraint language so that the constraints are assigned confidence scores.

11. A system for automatically identifying semi-structured data from a semi-structured web site by executing instructions stored in a computer-readable memory by a computer processor, the system comprising:
- means for analyzing links and pages on the semi-structured web site using a set of heterogeneous experts, each of the experts focusing on a respective type of structure included in the semi-structured web site;
  
  means for identifying, by the set of experts, similarities and dissimilarities between the analyzed links and pages;
  
  means for clustering pages and text segments based on the similarities and dissimilarities identified by at least two experts in the set of heterogeneous experts,wherein each of the at least two experts produces hints indicating whether two items should be together in a cluster, the hints containing respective levels of confidence; and
  
  wherein the clustering of text segments comprises;
  
  finding page clusters;
  
  determining a set of text segments for each of the found pare clusters; and
  
  clustering text segments of the set of text segments;
  
  means for identifying, based on the clustering of pages and text segments, at least some of the semi-structured data to be extracted from the semi-structured web site;
  
  means for extracting the at least some of the identified semi-structured data; and
  
  means for transforming the extracted semi-structured data into a relational structured form.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The system of claim 11, further comprising means for evaluating a probability of a clustering based on the hints to determine a quality of the clustering.
  - 13. The system of claim 11, wherein the clustering means provides at least two alternative clusterings.
  - 14. The system of claim 13 further comprises means for employing probabilistic models to rate the alternative clusterings.
  - 15. The system of claim 11, further comprising means for employing a generative probabilistic model to enable assignment of probabilities to the hints in view of a clustering.
  - 16. The system of claim 15, wherein all hints are assigned the probabilities.
  - 17. The system of claim 16, wherein probabilities of page hints are determined from page clusters.
  - 18. The system of claim 11, further comprising means for adding to the hints a binary hint that indicates that a particular pair of items are in the same cluster.
  - 19. The system of claim 18, further comprising means for extending a constraint language for constraint clustering, wherein constraints for the constraint clustering are defined in a form of must-link or cannot-link pairs.
  - 20. The system of claim 19 further comprising means for extending the constraint language so that the constraints are assigned confidence scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Import.io Corporation (Import-io Corp.)
Original Assignee
Connotate, Inc. (Import-io Corp.)
Inventors
Gazen, Bora C., Minton, Steven N.
Primary Examiner(s)
Hoang, Son T

Application Number

US13/191,369
Publication Number

US 20110282877A1
Time in Patent Office

1,155 Days
Field of Search

707/737, 707/755
US Class Current

707/737
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 16/95 Retrieval from the web

Method and system for automatically extracting data from web sites

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for automatically extracting data from web sites

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links