METHOD AND SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM WEB SITES

US 20110282877A1
Filed: 07/26/2011
Published: 11/17/2011
Est. Priority Date: 07/15/2005
Status: Active Grant

First Claim

Patent Images

1-41. -41. (canceled)

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In accordance with an embodiment, data may be automatically extracted from semi-structured web sites. Unsupervised learning may be used to analyze web sites and discover their structure. One method utilizes a set of heterogeneous “experts,” each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as “hints.” Based on these hints, the system may cluster the pages and text segments and identify semi-structured data that can be extracted. To identify a good clustering, a probabilistic model of the hint-generation process may be used.

Citations

70 Claims

1-41. -41. (canceled)

42. A method for automatically identifying semi-structured data from a semi-structured web site, the method comprising:
- executing instructions stored in memory by a processor for;
  
  developing a set of experts;
  
  analyzing links and pages on the semi-structured web site by means of the set of experts;
  
  identifying predetermined types of generic structures by means of the set of experts;
  
  clustering pages and text segments within the pages based on the identified structures; and
  
  identifying, based on the clustering, the semi-structured data that can be extracted.
- View Dependent Claims (43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
- - 43. The method of claim 42 further including executing instructions for extracting and structuring the semi-structured data on the semi-structured web site to enable the semi-structured data to be transformed into a relational form.
  - 44. The method of claim 42 wherein the set of experts produce hints indicating that two items should be in a same cluster.
  - 45. The method of claim 44 further including executing instructions for evaluating a probability of a clustering based on the hints to determine a quality of the clustering.
  - 46. The method of claim 44 wherein the clustering of pages provides at least two alternative clusterings.
  - 47. The method of claim 46 further including executing instructions for employing probabilistic models to rate the alternative clusterings.
  - 48. The method of claim 44 further including executing instructions for employing a generative probabilistic model to enable assignment of probabilities to the hints in view of a clustering.
  - 49. The method of claim 48 wherein both page hints and token hints are assigned the probabilities.
  - 50. The method of claim 49 wherein the probabilities of page hints are determined from page clusters.
  - 51. The method of claim 49 wherein the probabilities of token hints are determined from token clusters.
  - 52. The method of claim 44 wherein an expert identifies predetermined types of generic structures by adding to the hints a binary hint that indicates that two samples are in the same cluster.
  - 53. The method of claim 52 further including executing instructions for extending a constraint language for constraint clustering, wherein constraints for the constraint clustering are defined in a form of must-link or cannot-link pairs.
  - 54. The method of claim 53 further including executing instructions for extending the constraint language so that the constraints are assigned confidence scores, thereby permitting an expert in the set of experts to output hints with varying levels of confidence.
  - 55. The method of claim 42 further including executing instructions for, after the clustering of pages, performing the clustering of text segments which comprises:
    - finding page clusters;
      
      determining a set of text segments for each of the page clusters; and
      
      clustering text segments of the set of text segments.

56. A method for determining a relational form of data from a semi-structured web site, the method comprising:
- executing instructions stored in memory by a processor for;
  
  spidering the semi-structured web site to obtain a subject set of pages, including links on each of the subject set of pages;
  
  discovering low-level structures of the pages, text segments, and the links on the semi-structured web site by means of a set of experts, the set of experts being heterogeneous;
  
  clustering the pages and text segments to determine a consistent global structure to produce page and text segment clusters; and
  
  determining a relational form of the data from the page and text segment clusters.
- View Dependent Claims (57, 58, 59, 60, 61, 62)
- - 57. The method of claim 56 wherein each expert in the set of experts focuses on a particular type of structure independently from other experts in the set of experts.
  - 58. The method of claim 57 wherein the each expert in the set of experts analyzes the pages and the links with respect to its assigned type of structure, and outputs hints to indicate the similarities and dissimilarities between items of structure.
  - 59. The method of claim 58 further including executing instructions for:
    - dividing the pages into individual components based on the hints, the individual components being tokens, the pages and tokens being clustered producing page clusters and token clusters;
      
      providing a relational form of the data by producing tables, each page cluster having tables and each column in each table represented by one token cluster.
  - 60. The method of claim 59 wherein the hints indicate that two items should be in the same cluster.
  - 61. The method of claim 60 wherein the hints describe local structural similarities between pairs of at least one of pages and tokens.
  - 62. The method of claim 61 wherein the hints are page-level and token-level, the page-level hints each including a pair of page references and an indication that the pair of page references should be in a same page cluster, and the token-level hints each including a pair of token sequences an indication that tokens of the token sequences should be in a same token cluster.

63. A non-transitory computer readable storage medium having embodied thereon a program, the program being executable by a processor to perform a method for automatically identifying data from a semi-structured web site, the method comprising:
- developing experts;
  
  analyzing links and pages on the web site by means of the experts;
  
  identifying predetermined types of generic structures by means of the experts;
  
  clustering pages and text segments within the pages based on the identified structures; and
  
  identifying, based on the clustering, the data that can be extracted.
- View Dependent Claims (64, 65)
- - 64. The computer readable storage medium of claim 63 wherein the method further comprises:
    - extracting and structuring the data on the web site to enable the data to be transformed into a relational form.
  - 65. The computer readable storage medium of claim 64 wherein the method further comprises:
    - producing by means of the experts hints indicating that two items should be in the same cluster resulting in a clustering; and
      
      evaluating the probability of the clustering to determine a quality of the clustering.

66. A system for automatically identifying data from a semi-structured web site, the system comprising:
- a processor for executing instructions for;
  
  developing experts;
  
  analyzing the links and pages on the website web site by means of the experts;
  
  identifying predetermined types of generic structures by means of the experts;
  
  clustering pages and text segments within the pages based on the identified structures; and
  
  identifying, based on the clustering, the data that can be extracted.
- View Dependent Claims (67, 68, 69, 70)
- - 67. The system of claim 66 wherein the processor further executes instructions for:
    - extracting and structuring the data on the semi-structured web site to enable the data to be transformed into relational form.
  - 68. The system of claim 67 wherein the experts produce hints indicating that two items should be in the same cluster.
  - 69. The system of claim 68 wherein the processor further executes instructions for:
    - evaluating a probability of the clustering based on the hints, the hints used to determine a quality of the clustering.
  - 70. The system of claim 69 wherein the processor further executes instructions for:
    - employing a generative probabilistic model to enable assignment of a probability to the hints in view of a clustering.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Import.io Corporation (Import-io Corp.)
Original Assignee
Fetch Technologies, Inc.
Inventors
MINTON, Steven N., GAZEN, Bora C.

Granted Patent

US 8,843,490 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 16/95 Retrieval from the web

METHOD AND SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM WEB SITES

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

70 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD AND SYSTEM FOR AUTOMATICALLY EXTRACTING DATA FROM WEB SITES

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

70 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links