Systems and methods for extracting information from structured documents

US 8,090,678 B1
Filed: 07/23/2003
Issued: 01/03/2012
Est. Priority Date: 07/23/2003
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:

presenting a list of web sites to a user;

receiving one or more of the web sites selected from the user for data extraction;

collecting a plurality of co-occurring different HTML structured documents for each of the selected web sites at a computer comprising a processor;

forming a plurality of clusters comprising different subsets of the co-occurring HTML structured documents, wherein;

each cluster comprises a different HTML structured document of the plurality of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the plurality of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document,the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster,an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the particular centroid document in a different structured location;

displaying a list of clusters;

displaying the centroid document of a particular cluster selected from the list of clusters;

marking a data element on the centroid document of the particular cluster;

identifying a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and

providing a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for extracting information from structured documents are provided. The systems and methods relate to selecting a centroid document from a group of structured documents, selecting a subset of the group of structured documents in order to form a cluster of the subset of documents about the centroid document. The selecting the subset is preferably based on the relative similarity between each of the selected subset and the centroid document. Then, systems and methods according to the invention include marking a data element on the centroid document. The systems and elements also include identifying a data element on each of the subset of documents, the data element that corresponds to the marked data element on the centroid document. Finally, data may be extracted from the subset of documents based on the identifying step.

Citations

27 Claims

1. A computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:
- presenting a list of web sites to a user;
  
  receiving one or more of the web sites selected from the user for data extraction;
  
  collecting a plurality of co-occurring different HTML structured documents for each of the selected web sites at a computer comprising a processor;
  
  forming a plurality of clusters comprising different subsets of the co-occurring HTML structured documents, wherein;
  
  each cluster comprises a different HTML structured document of the plurality of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the plurality of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document,the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster,an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the particular centroid document in a different structured location;
  
  displaying a list of clusters;
  
  displaying the centroid document of a particular cluster selected from the list of clusters;
  
  marking a data element on the centroid document of the particular cluster;
  
  identifying a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and
  
  providing a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 27)
- - 2. The method of claim 1, further comprising collecting a plurality of HTML structured documents from a merchant web site.
  - 3. The method of claim 1, further comprising displaying a listing of web sites that include clusters of HTML structured documents.
  - 4. The method of claim 1, further comprising automatically marking the data element on the centroid document of the particular cluster.
  - 5. The method of claim 1, where in the threshold is predetermined.
  - 6. The method of claim 1, wherein the threshold is automatically generated.
  - 7. The method of claim 1, further comprising extracting data from the HTML structured document that is the centroid document of the particular cluster based on the marked data element of the centroid document of the particular cluster.
  - 8. The method of claim 1, further comprising extracting data from the other HTML structured documents of the particular cluster based on the identified data element on each of the other HTML structured documents of the particular cluster.
  - 9. The method of claim 8, further comprising formatting the extracted data in a table.
  - 10. The method of claim 8, further comprising formatting the extracted data in a spreadsheet format.
  - 11. The method of claim 1, further comprising:
    - receiving a new HTML structured document;
      
      selecting the most similar centroid document to the new HTML structured document based on the relative structural similarity of HTML data structure of the new HTML structured document and HTML data structure of each centroid document of each cluster;
      
      adding the new HTML structured document to the cluster that includes the selected centroid document if the new HTML structured document achieves a threshold of similarity with respect to the selected centroid document;
      
      marking a data element on the selected centroid document; and
      
      identifying a data element on the new HTML structured document that corresponds to the marked data element on the selected centroid document.
  - 12. The method of claim 1, further comprising using an alignment algorithm to extract the data element identified on each of the other HTML structured documents that corresponds to the marked data element on the centroid document of the particular cluster.
  - 27. The method of claim 1, wherein the list of clusters comprises a uniform resource locator of each centroid document of each cluster and a number of HTML structured documents associated with each cluster.

13. An apparatus for implementing a data extraction process, the apparatus comprising a workstation storage device, a workstation processor connected to the workstation storage device, the workstation storage device storing a workstation program for controlling the workstation processor, and the workstation processor operative with the workstation program to:
- present a list of web sites to a user;
  
  receive one or more of the web sites selected from the user for data extraction;
  
  form a plurality of clusters comprising different subsets of a group of co-occurring Hyper Text Mark-up Language (HTML) structured documents for each of the selected web sites, wherein;
  
  each cluster comprises a different HTML structured document of the group of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the group of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document,the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster,an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the particular centroid document in a different structured location;
  
  display a list of clusters;
  
  display the centroid document of a particular cluster selected from the list of clusters;
  
  mark a data element on the centroid document of the particular cluster;
  
  identify a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and
  
  provide a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 14. The apparatus of claim 13, further configured to collect a plurality of HTML structured documents from a merchant web site.
  - 15. The apparatus of claim 13, further configured to display a plurality of web sites that include clusters of HTML structured documents.
  - 16. The apparatus of claim 13, further configured to automatically mark the data element on the centroid document of the particular cluster.
  - 17. The apparatus of claim 13, further configured to determine the co-occurring HTML structured documents that achieve a pre-determined threshold.
  - 18. The apparatus of claim 13, further configured to determine the co-occurring HTML structured documents that achieve an automatically-generated threshold.
  - 19. The apparatus of claim 13, further configured to extract data from the HTML structured document that is the centroid document of the particular cluster based on the marked data element of the centroid document of the particular cluster.
  - 20. The apparatus of claim 13, further configured to extract data from the other HTML structured documents of the particular cluster based on the identified data element on each of the other HTML structured documents of the particular cluster.
  - 21. The apparatus of claim 20, further configured to format the extracted data in a table.
  - 22. The apparatus of claim 20, further configured to format the extracted data in a spreadsheet format.

23. An apparatus for implementing a data extraction process, the apparatus comprising a workstation storage device, a workstation processor connected to the workstation storage device, the workstation storage device storing a workstation program for controlling the workstation processor, and the workstation processor operative with the workstation program to:
- present a list of web sites to a user;
  
  receive one or more of the web sites selected from the user for data extraction;
  
  form a plurality of clusters comprising different subsets of a group of co-occurring Hyper Text Mark-up Language (HTML) structured documents for each of the selected web sites, wherein each cluster comprises a different HTML structured document of the group of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the group of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document;
  
  select a first centroid document from a first cluster of the plurality of clusters;
  
  select an HTML structured document from the group of co-occurring HTML structured documents that is not included in the first cluster;
  
  compare the selected HTML structured document to the first centroid document based on relative structural similarity of HTML data structure of the selected HTML structured document with respect to HTML data structure of the first centroid document, wherein;
  
  an alignment algorithm is used to determine whether the selected HTML structured document achieves a threshold of similarity with respect to the first centroid document by comparing structured locations of data fields for storing data elements within the first centroid document and structured locations of corresponding data fields for storing data elements within the selected HTML structured document,the selected HTML structured document is compared to the first centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of the selected HTML structured document with respect to the first centroid document is penalized when the selected HTML structured document includes a data field that is within the first centroid document in a different structured location;
  
  add the selected HTML structured document to the first cluster if the selected HTML structured document achieves the threshold of similarity with respect to the first centroid document;
  
  display a list of clusters;
  
  display the first centroid document in response to selection of the first cluster from the list of clusters;
  
  mark a data element on the first centroid document;
  
  correlate the marked data element in the first centroid document with a corresponding data element in each of the other HTML structured documents of the first cluster when the corresponding data element is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the first centroid document;
  
  extract the corresponding data element in each of the other HTML structured documents of the first cluster; and
  
  provide a user interface displaying content of the corresponding data element of each of the other HTML structured documents of the first cluster on a computer display.

24. A computer-readable storage medium storing a computer program comprising instructions that, when executed, cause a computer to perform a computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:
- presenting a list of web sites to a user;
  
  receiving one or more of the web sites selected from the user for data extraction;
  
  collecting a plurality of co-occurring different HTML structured documents for each of the selected web sites at the computer;
  
  forming a plurality of clusters comprising different subsets of the co-occurring HTML structured documents, wherein;
  
  each cluster comprises a different HTML structured document of the plurality of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the plurality of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document,the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster,an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the centroid document in a different structured location;
  
  displaying a list of clusters;
  
  displaying the centroid document of a particular cluster selected from the list of clusters;
  
  marking a data element on the centroid document of the particular cluster;
  
  identifying a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and
  
  providing a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display.

25. A computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:
- presenting a list of web sites to a user;
  
  receiving one or more of the web sites selected from the user for data extraction;
  
  forming a plurality of clusters comprising different subsets of a group of co-occurring Hyper Text Mark-up Language (HTML) structured documents for each of the selected web sites, wherein each cluster comprises a different HTML structured document of the group of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the group of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document;
  
  selecting a first centroid document from a first cluster of the plurality of clusters;
  
  selecting an HTML structured document from the group of co-occurring HTML structured documents that is not included in the first cluster;
  
  comparing the selected HTML structured document to the first centroid document based on relative structural similarity of HTML data structure of the selected HTML structured document with respect to HTML data structure of the first centroid document, wherein;
  
  an alignment algorithm is used to determine whether the selected HTML structured document achieves a threshold of similarity with respect to the first centroid document by comparing structured locations of data fields for storing data elements within the first centroid document and structured locations of corresponding data fields for storing data elements within the selected HTML structured document,the selected HTML structured document is compared to the first centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of the selected HTML structured document with respect to the first centroid document is penalized when the selected HTML structured document includes a data field that is within the first centroid document in a different structured location;
  
  adding the selected HTML structured document to the first cluster if the selected HTML structured document achieves the threshold of similarity with respect to the first centroid document;
  
  displaying a list of clusters;
  
  displaying the first centroid document in response to selection of the first cluster from the list of clusters;
  
  marking a data element on the first centroid document;
  
  correlating the marked data element in the first centroid document with a corresponding data element in each of the other HTML structured documents of the first cluster when the corresponding data element is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the first centroid document;
  
  extracting the corresponding data element in each of the other HTML structured documents of the first cluster; and
  
  providing a user interface displaying content of the corresponding data element of each of the other HTML structured documents of the first cluster on a computer display.
- View Dependent Claims (26)
- - 26. The method of claim 25, wherein the list of clusters comprises a uniform resource locator of each centroid document of each cluster and a number of HTML structured documents associated with each cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
PayPal, Inc. (PayPal Holdings, Inc.)
Original Assignee
Shopping.com Ltd. (eBay Inc.)
Inventors
Glickman, Oren, Ashkenazi, Amir, Yaar, Ariel
Primary Examiner(s)
Leroux, Etienne
Assistant Examiner(s)
Bibbee, Jared

Application Number

US10/626,430
Time in Patent Office

3,086 Days
Field of Search

707/102, 707/602, 707/728, 707/813
US Class Current

707/602
CPC Class Codes

G06F 16/35 Clustering; Classification

Systems and methods for extracting information from structured documents

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for extracting information from structured documents

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links