JOINT OPTIMIZATION OF WRAPPER GENERATION AND TEMPLATE DETECTION
First Claim
1. A method in a computing device for generating a wrapper for hierarchically organized documents, each document having a document tree, the method comprising:
- creating a wrapper tree for a document tree;
selecting a document tree whose distance to the wrapper tree is within a threshold; and
adjusting the wrapper tree based on the document tree wherein the wrapper is based on the adjusted wrapper tree.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for generating wrappers for hierarchically organized documents by jointly optimizing template detection and wrapper generation is provided. A wrapper generation system generates a wrapper for documents with similar templates by identifying a cluster of document trees and generating a wrapper tree for the cluster. A wrapper tree defines the wrapper for documents that match the template of the cluster. The wrapper generation system clusters document trees by generating a wrapper tree for the cluster based on an initial document tree. The wrapper generation system then repeatedly determines whether any other document tree matches or nearly matches the wrapper tree for the cluster and, if so, adds the document tree to the cluster and adjusts the wrapper tree as appropriate so that all the document trees, including the newly added one, match the wrapper tree.
72 Citations
20 Claims
-
1. A method in a computing device for generating a wrapper for hierarchically organized documents, each document having a document tree, the method comprising:
-
creating a wrapper tree for a document tree; selecting a document tree whose distance to the wrapper tree is within a threshold; and adjusting the wrapper tree based on the document tree wherein the wrapper is based on the adjusted wrapper tree. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computing system that determines similarity between a hierarchically organized document and a wrapper tree, the document having a document tree, the system comprising:
-
a component that aligns nodes of the document tree with nodes of the wrapper tree; and a component that generates a metric from the number of misaligned nodes, the metric indicating similarity between the document tree and the wrapper tree. - View Dependent Claims (13, 14, 15)
-
-
16. A computer-readable medium containing instructions for controlling a computing system to generate wrapper trees for document trees, by a method comprising:
for each wrapper tree, selecting a document tree that has not been previously selected; creating the wrapper tree for the selected document tree; and when there exists an unselected document tree whose distance from the wrapper is less than a threshold, selecting the document tree and adjusting the wrapper tree based on the selected document tree. - View Dependent Claims (17, 18, 19, 20)
Specification