Systems and methods for extracting information from structured documents
First Claim
1. A computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:
- presenting a list of web sites to a user;
receiving one or more of the web sites selected from the user for data extraction;
collecting a plurality of co-occurring different HTML structured documents for each of the selected web sites at a computer comprising a processor;
forming a plurality of clusters comprising different subsets of the co-occurring HTML structured documents, wherein;
each cluster comprises a different HTML structured document of the plurality of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the plurality of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document,the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster,an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, andthe relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the particular centroid document in a different structured location;
displaying a list of clusters;
displaying the centroid document of a particular cluster selected from the list of clusters;
marking a data element on the centroid document of the particular cluster;
identifying a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and
providing a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display.
4 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for extracting information from structured documents are provided. The systems and methods relate to selecting a centroid document from a group of structured documents, selecting a subset of the group of structured documents in order to form a cluster of the subset of documents about the centroid document. The selecting the subset is preferably based on the relative similarity between each of the selected subset and the centroid document. Then, systems and methods according to the invention include marking a data element on the centroid document. The systems and elements also include identifying a data element on each of the subset of documents, the data element that corresponds to the marked data element on the centroid document. Finally, data may be extracted from the subset of documents based on the identifying step.
-
Citations
27 Claims
-
1. A computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:
-
presenting a list of web sites to a user; receiving one or more of the web sites selected from the user for data extraction; collecting a plurality of co-occurring different HTML structured documents for each of the selected web sites at a computer comprising a processor; forming a plurality of clusters comprising different subsets of the co-occurring HTML structured documents, wherein; each cluster comprises a different HTML structured document of the plurality of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the plurality of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document, the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster, an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, and the relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the particular centroid document in a different structured location; displaying a list of clusters; displaying the centroid document of a particular cluster selected from the list of clusters; marking a data element on the centroid document of the particular cluster; identifying a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and providing a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 27)
-
-
13. An apparatus for implementing a data extraction process, the apparatus comprising a workstation storage device, a workstation processor connected to the workstation storage device, the workstation storage device storing a workstation program for controlling the workstation processor, and the workstation processor operative with the workstation program to:
-
present a list of web sites to a user; receive one or more of the web sites selected from the user for data extraction; form a plurality of clusters comprising different subsets of a group of co-occurring Hyper Text Mark-up Language (HTML) structured documents for each of the selected web sites, wherein;
each cluster comprises a different HTML structured document of the group of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the group of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document,the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster, an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, and the relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the particular centroid document in a different structured location; display a list of clusters; display the centroid document of a particular cluster selected from the list of clusters; mark a data element on the centroid document of the particular cluster; identify a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and provide a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. An apparatus for implementing a data extraction process, the apparatus comprising a workstation storage device, a workstation processor connected to the workstation storage device, the workstation storage device storing a workstation program for controlling the workstation processor, and the workstation processor operative with the workstation program to:
-
present a list of web sites to a user; receive one or more of the web sites selected from the user for data extraction; form a plurality of clusters comprising different subsets of a group of co-occurring Hyper Text Mark-up Language (HTML) structured documents for each of the selected web sites, wherein each cluster comprises a different HTML structured document of the group of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the group of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document; select a first centroid document from a first cluster of the plurality of clusters; select an HTML structured document from the group of co-occurring HTML structured documents that is not included in the first cluster; compare the selected HTML structured document to the first centroid document based on relative structural similarity of HTML data structure of the selected HTML structured document with respect to HTML data structure of the first centroid document, wherein; an alignment algorithm is used to determine whether the selected HTML structured document achieves a threshold of similarity with respect to the first centroid document by comparing structured locations of data fields for storing data elements within the first centroid document and structured locations of corresponding data fields for storing data elements within the selected HTML structured document, the selected HTML structured document is compared to the first centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, and the relative structural similarity of the selected HTML structured document with respect to the first centroid document is penalized when the selected HTML structured document includes a data field that is within the first centroid document in a different structured location; add the selected HTML structured document to the first cluster if the selected HTML structured document achieves the threshold of similarity with respect to the first centroid document; display a list of clusters; display the first centroid document in response to selection of the first cluster from the list of clusters; mark a data element on the first centroid document; correlate the marked data element in the first centroid document with a corresponding data element in each of the other HTML structured documents of the first cluster when the corresponding data element is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the first centroid document; extract the corresponding data element in each of the other HTML structured documents of the first cluster; and provide a user interface displaying content of the corresponding data element of each of the other HTML structured documents of the first cluster on a computer display.
-
-
24. A computer-readable storage medium storing a computer program comprising instructions that, when executed, cause a computer to perform a computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:
-
presenting a list of web sites to a user; receiving one or more of the web sites selected from the user for data extraction; collecting a plurality of co-occurring different HTML structured documents for each of the selected web sites at the computer; forming a plurality of clusters comprising different subsets of the co-occurring HTML structured documents, wherein; each cluster comprises a different HTML structured document of the plurality of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the plurality of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document, the clusters are formed by comparing each co-occurring HTML structured document to each centroid document of each cluster based on relative structural similarity of HTML data structure of each co-occurring HTML structured document with respect to HTML data structure of each centroid document of each cluster, an alignment algorithm is used to determine the co-occurring HTML structured documents that achieve the threshold of similarity with respect to each centroid document by comparing structured locations of data fields for storing data elements within each centroid document and structured locations of corresponding data fields for storing data elements within each of the co-occurring HTML structured documents, the co-occurring HTML structured documents are compared to each centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, and the relative structural similarity of a particular co-occurring HTML structured document with respect to a particular centroid document is penalized when the co-occurring HTML structured document includes a data field that is within the centroid document in a different structured location; displaying a list of clusters; displaying the centroid document of a particular cluster selected from the list of clusters; marking a data element on the centroid document of the particular cluster; identifying a data element on each of the other HTML structured documents of the particular cluster that is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the centroid document of the particular cluster; and providing a user interface displaying content of data elements identified from the other HTML structured documents of the particular cluster on a computer display.
-
-
25. A computer-implemented method of extracting information from co-occurring Hyper Text Mark-up Language (HTML) structured documents, the method comprising:
-
presenting a list of web sites to a user; receiving one or more of the web sites selected from the user for data extraction; forming a plurality of clusters comprising different subsets of a group of co-occurring Hyper Text Mark-up Language (HTML) structured documents for each of the selected web sites, wherein each cluster comprises a different HTML structured document of the group of co-occurring HTML structured documents as a centroid document and other HTML structured documents of the group of co-occurring HTML structured documents that achieve a threshold of similarity with respect to the centroid document; selecting a first centroid document from a first cluster of the plurality of clusters; selecting an HTML structured document from the group of co-occurring HTML structured documents that is not included in the first cluster; comparing the selected HTML structured document to the first centroid document based on relative structural similarity of HTML data structure of the selected HTML structured document with respect to HTML data structure of the first centroid document, wherein; an alignment algorithm is used to determine whether the selected HTML structured document achieves a threshold of similarity with respect to the first centroid document by comparing structured locations of data fields for storing data elements within the first centroid document and structured locations of corresponding data fields for storing data elements within the selected HTML structured document, the selected HTML structured document is compared to the first centroid document based on similarity of structured locations of corresponding data fields within the HTML data structures without regard to content of data elements stored in the corresponding data fields within the HTML data structures, and the relative structural similarity of the selected HTML structured document with respect to the first centroid document is penalized when the selected HTML structured document includes a data field that is within the first centroid document in a different structured location; adding the selected HTML structured document to the first cluster if the selected HTML structured document achieves the threshold of similarity with respect to the first centroid document; displaying a list of clusters; displaying the first centroid document in response to selection of the first cluster from the list of clusters; marking a data element on the first centroid document; correlating the marked data element in the first centroid document with a corresponding data element in each of the other HTML structured documents of the first cluster when the corresponding data element is stored within a data field having a structured location that corresponds to the structured location of the data field storing the marked data element within the first centroid document; extracting the corresponding data element in each of the other HTML structured documents of the first cluster; and providing a user interface displaying content of the corresponding data element of each of the other HTML structured documents of the first cluster on a computer display. - View Dependent Claims (26)
-
Specification