GENERATING DOCUMENT TEMPLATES THAT ARE ROBUST TO STRUCTURAL VARIATIONS
First Claim
1. A method for managing document templates, comprising:
- forming a plurality of clusters for a plurality of sub-trees of a tree, at a first level of the tree, based on a cost measure of adding each sub-tree to each cluster;
generating a separate merged sub-tree for each cluster based on a merging of each sub-tree that corresponds to a particular cluster;
replacing each sub-tree that corresponds to the particular cluster with the corresponding merged sub-tree; and
repeating, for a next higher level of the tree in relation to a root of the tree, the actions of forming another plurality of clusters, generating another separate merged sub-tree for each of the plurality of other clusters, and replacing each sub-tree corresponding to another particular cluster of the plurality of other clusters with another corresponding merged sub-tree.
3 Assignments
0 Petitions
Accused Products
Abstract
A template or wrapper tree for a document such as a web page is generalized from the bottom up (from leaf toward root of a logical tree structure of the template). At a given level in the tree, sub-trees are clustered and the clustered sub-trees are generalized, and the process is repeated at a next higher level in the tree, resulting in a generalized template or wrapper tree. This can be done by generating a nested pattern regular expression based on the sub-tree clusters, merging sub-trees based on the nested pattern regular expression, and then replacing sub-trees in a tree-based regular expression of the template or wrapper at the given level with the merged sub-trees. This process is repeated at a next higher level of the tree (progressing from leaf towards root) until the wrapper or tree-based regular expression that represents the template is fully generalized.
-
Citations
19 Claims
-
1. A method for managing document templates, comprising:
-
forming a plurality of clusters for a plurality of sub-trees of a tree, at a first level of the tree, based on a cost measure of adding each sub-tree to each cluster; generating a separate merged sub-tree for each cluster based on a merging of each sub-tree that corresponds to a particular cluster; replacing each sub-tree that corresponds to the particular cluster with the corresponding merged sub-tree; and repeating, for a next higher level of the tree in relation to a root of the tree, the actions of forming another plurality of clusters, generating another separate merged sub-tree for each of the plurality of other clusters, and replacing each sub-tree corresponding to another particular cluster of the plurality of other clusters with another corresponding merged sub-tree. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A network device configured to manage document templates, comprising:
-
a transceiver to send and receive data over a network; and a processor that is operative to enable actions for; forming a plurality of clusters for a plurality of sub-trees of a tree, at a first level of the tree, based on a cost measure of adding each sub-tree to each cluster; generating a separate merged sub-tree for each cluster based on a merging of each sub-tree that corresponds to a particular cluster; replacing each sub-tree that corresponds to the particular cluster with the corresponding merged sub-tree; and repeating, for a next higher level of the tree in relation to a root of the tree, the actions of forming another plurality of clusters, generating another separate merged sub-tree for each of the plurality of other clusters, and replacing each sub-tree corresponding to another particular cluster of the plurality of other clusters with another corresponding merged sub-tree. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A processor readable storage medium that includes data and instructions that if executed by a processor enables actions for managing document templates, comprising:
-
forming a plurality of clusters for a plurality of sub-trees of a tree, at a first level of the tree, based on a cost measure of adding each sub-tree to each cluster; generating a separate merged sub-tree for each cluster based on a merging of each sub-tree that corresponds to a particular cluster; replacing each sub-tree that corresponds to the particular cluster with the corresponding merged sub-tree; and repeating, for a next higher level of the tree in relation to a root of the tree, the actions of forming another plurality of clusters, generating another separate merged sub-tree for each of the plurality of other clusters, and replacing each sub-tree corresponding to another particular cluster of the plurality of other clusters with another corresponding merged sub-tree. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification