Joint optimization of wrapper generation and template detection
First Claim
1. A method in a computing device with a processor and a memory for generating wrappers for hierarchically organized documents, each document having a document tree with nodes, the method comprising:
- generating by the processor, for each of a plurality of clusters of documents, a wrapper by repeating the following until all the documents have been selected;
selecting a document that has not yet been selected for creation of a wrapper tree having nodes;
creating the wrapper tree for the document tree of the selected document;
for each document whose distance from its document tree to the wrapper tree is within a threshold distance,selecting the document; and
adjusting the wrapper tree based on the document tree of the selected document; and
establishing the wrapper for the documents selected for creation and adjustment of the wrapper tree based on the adjusted wrapper treewherein a wrapper tree is created and adjusted for each cluster of documents whose document trees are within a threshold distance of the wrapper tree at the time of selection of the document, andwherein distance is represented by the following equation;
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for generating wrappers for hierarchically organized documents by jointly optimizing template detection and wrapper generation is provided. A wrapper generation system generates a wrapper for documents with similar templates by identifying a cluster of document trees and generating a wrapper tree for the cluster. A wrapper tree defines the wrapper for documents that match the template of the cluster. The wrapper generation system clusters document trees by generating a wrapper tree for the cluster based on an initial document tree. The wrapper generation system then repeatedly determines whether any other document tree matches or nearly matches the wrapper tree for the cluster and, if so, adds the document tree to the cluster and adjusts the wrapper tree as appropriate so that all the document trees, including the newly added one, match the wrapper tree.
39 Citations
11 Claims
-
1. A method in a computing device with a processor and a memory for generating wrappers for hierarchically organized documents, each document having a document tree with nodes, the method comprising:
-
generating by the processor, for each of a plurality of clusters of documents, a wrapper by repeating the following until all the documents have been selected; selecting a document that has not yet been selected for creation of a wrapper tree having nodes; creating the wrapper tree for the document tree of the selected document; for each document whose distance from its document tree to the wrapper tree is within a threshold distance, selecting the document; and adjusting the wrapper tree based on the document tree of the selected document; and establishing the wrapper for the documents selected for creation and adjustment of the wrapper tree based on the adjusted wrapper tree wherein a wrapper tree is created and adjusted for each cluster of documents whose document trees are within a threshold distance of the wrapper tree at the time of selection of the document, and wherein distance is represented by the following equation; - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computing system with a processor and a memory that determines similarity between a hierarchically organized document and a wrapper tree, the document having a document tree, the system comprising:
-
components implemented as instructions stored in memory for execution by the processor that include; a component that aligns nodes of the document tree with nodes of the wrapper tree; and a component that generates a metric from the number of misaligned nodes, the metric indicating similarity between the document tree and the wrapper tree wherein the metric is represented by the following equation;
-
-
8. A computer-readable storage medium containing instructions for controlling a computing system to generate wrapper trees for document trees, the document trees and wrapper trees having nodes, comprising:
-
for each of a plurality of wrapper trees, selecting a document tree that has not been previously selected; creating the wrapper tree for the selected document tree; and when there exists an unselected document tree whose distance from the wrapper tree is less than a threshold distance, selecting the document tree and adjusting the wrapper tree based on the selected document tree wherein each wrapper tree represents a wrapper for a cluster of documents whose document tree is within a threshold distance of the wrapper tree before the wrapper tree is adjusted, and wherein distance is represented by the following equation; - View Dependent Claims (9, 10, 11)
-
Specification