GENERATING DOCUMENT TEMPLATES THAT ARE ROBUST TO STRUCTURAL VARIATIONS
First Claim
1. A network device configured to manage document templates, comprising:
- a transceiver to send and receive data over a network; and
a processor that is operative to enable actions for;
receiving a tree-based regular expression that represents the template;
below a given level in the tree-based regular expression, performing;
forming clusters of sub-trees of the tree-based regular expression via a cost measure;
generating a nested pattern regular expression based on the clusters;
merging sub-trees based on the nested pattern regular expression;
replacing sub-trees in the tree-based regular expression at the given level with the merged sub-trees; and
repeating, for a next higher level of the tree-based regular expression that is closer to a root of the corresponding tree, the actions of forming clusters, generating a nested pattern regular expression, merging sub-trees, and replacing sub-trees in the tree-based regular expression.
9 Assignments
0 Petitions
Accused Products
Abstract
A template or wrapper tree for a document such as a web page is generalized from the bottom up (from leaf toward root of a logical tree structure of the template). At a given level in the tree, sub-trees are clustered and the clustered sub-trees are generalized, and the process is repeated at a next higher level in the tree, resulting in a generalized template or wrapper tree. This can be done by generating a nested pattern regular expression based on the sub-tree clusters, merging sub-trees based on the nested pattern regular expression, and then replacing sub-trees in a tree-based regular expression of the template or wrapper at the given level with the merged subtrees. This process is repeated at a next higher level of the tree (progressing from leaf towards root) until the wrapper or tree-based regular expression that represents the template is fully generalized.
-
Citations
20 Claims
-
1. A network device configured to manage document templates, comprising:
-
a transceiver to send and receive data over a network; and a processor that is operative to enable actions for; receiving a tree-based regular expression that represents the template; below a given level in the tree-based regular expression, performing; forming clusters of sub-trees of the tree-based regular expression via a cost measure; generating a nested pattern regular expression based on the clusters; merging sub-trees based on the nested pattern regular expression; replacing sub-trees in the tree-based regular expression at the given level with the merged sub-trees; and repeating, for a next higher level of the tree-based regular expression that is closer to a root of the corresponding tree, the actions of forming clusters, generating a nested pattern regular expression, merging sub-trees, and replacing sub-trees in the tree-based regular expression. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for generalizing a structural template for an electronic document, comprising:
-
receiving a tree-based regular expression that represents the template; below a given level in the tree-based regular expression, performing; forming clusters of sub-trees of the tree-based regular expression via a cost measure; generating a nested pattern regular expression based on the clusters; merging sub-trees based on the nested pattern regular expression; replacing sub-trees in the tree-based regular expression at the given level with the merged sub-trees; and repeating, for a next higher level of the tree-based regular expression that is closer to a root of the corresponding tree, the actions of forming clusters, generating a nested pattern regular expression, merging sub-trees, and replacing sub-trees in the tree-based regular expression. - View Dependent Claims (9, 10, 11)
-
-
12. A processor readable medium that includes data and instructions, wherein the execution of the instructions provides for managing a document template by enabling actions, comprising:
-
receiving a tree-based regular expression that represents the template; below a given level in the tree-based regular expression, performing; forming clusters of sub-trees of the tree-based regular expression via a cost measure; generating a nested pattern regular expression based on the clusters; merging sub-trees based on the nested pattern regular expression; replacing sub-trees in the tree-based regular expression at the given level with the merged sub-trees; and repeating, for a next higher level of the tree-based regular expression that is closer to a root of the corresponding tree, the actions of forming clusters, generating a nested pattern regular expression, merging sub-trees, and replacing sub-trees in the tree-based regular expression. - View Dependent Claims (13, 14, 15)
-
-
16. A system that manages document templates, comprising:
-
a network device that includes; a transceiver for communicating with at least one mobile device over a network; and processor for enabling actions, comprising; receiving a tree-based regular expression that represents a document template; below a given level in the tree-based regular expression, performing; forming clusters of sub-trees of the tree-based regular expression via a cost measure; generating a nested pattern regular expression based on the clusters; merging sub-trees based on the nested pattern regular expression; replacing sub-trees in the tree-based regular expression at the given level with the merged sub-trees; repeating, for a next higher level of the tree-based regular expression that is closer to a root of the corresponding tree, the actions of forming clusters, generating a nested pattern regular expression, merging sub-trees, and replacing sub-trees in the tree-based regular expression; and extracting information from a web page based on a document template corresponding to the tree-based regular expression; and the at least one mobile device that further includes; a transceiver for communicating with at least the network device over the network; and a processor for enabling actions, comprising; receiving the extracted information. - View Dependent Claims (17, 18)
-
-
19. A mobile device configured to manage document templates, comprising:
-
a transceiver to send and receive data over a network; and a processor that is operative to enable actions for; receiving a tree-based regular expression that represents a document template; below a given level in the tree-based regular expression, performing; forming clusters of sub-trees of the tree-based regular expression via a cost measure; generating a nested pattern regular expression based on the clusters; merging sub-trees based on the nested pattern regular expression; and replacing sub-trees in the tree-based regular expression at the given level with the merged sub-trees; repeating, for a next higher level of the tree-based regular expression that is closer to a root of the corresponding tree, the actions of forming clusters, generating a nested pattern regular expression, merging sub-trees, and replacing sub-trees in the tree-based regular expression; extracting information from a web page based on the document template corresponding to the tree-based regular expression; and displaying the extracted information to a user. - View Dependent Claims (20)
-
Specification