Techniques for inducing high quality structural templates for electronic documents
First Claim
1. A method comprising:
- comparing, one document at a time, a structure of documents in a training set with a structure of an initial template;
selecting at least one of the documents based on the comparing;
generalizing the initial template to create a generalized template that has a structure that matches each of the selected documents;
wherein generalizing the initial template to create the generalized template includes adding one or more operators to the initial template from a set of operators to create the generalized template, wherein the one or more operators includes a first operator that indicates that only one of a plurality of subtrees below the operator is allowed to occur at a position in the selected documents that corresponds to the position of the first operator in the generalized template.
9 Assignments
0 Petitions
Accused Products
Abstract
Techniques are disclosed herein to automatically learn a template that describes a common structure present in documents in a training set. The structure of the template is compared to the structure of the documents (or at least a part of each document) in the training set, one-by-one, and generalized in response to differences between the template and the document to which the template is currently being compared. If the structure of any particular document is considered too dissimilar from the structure of the template, then the template is not modified. Various generalization operators are added to the template to generalize the template. One such generalization operator is an “OR”, which indicates that only one of “n” sub-trees below the “OR” operator in the template is allowed at the corresponding position in a document.
97 Citations
21 Claims
-
1. A method comprising:
-
comparing, one document at a time, a structure of documents in a training set with a structure of an initial template; selecting at least one of the documents based on the comparing; generalizing the initial template to create a generalized template that has a structure that matches each of the selected documents; wherein generalizing the initial template to create the generalized template includes adding one or more operators to the initial template from a set of operators to create the generalized template, wherein the one or more operators includes a first operator that indicates that only one of a plurality of subtrees below the operator is allowed to occur at a position in the selected documents that corresponds to the position of the first operator in the generalized template. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of generalizing a template having a plurality of sub-trees, said method comprising:
-
comparing a structure of a first sub-tree of the plurality of sub-trees with a structure of a second sub-tree of the plurality of sub-trees, wherein the first sub-tree and the second sub-tree are not structurally identical; determining a degree of structural similarity between the first sub-tree and the second sub-tree; and in response to determining that the degree of structural similarity is more than a specified threshold, generalizing the structure of the first sub-tree to generate a modified first sub-tree in the template, wherein the structure of the modified first sub-tree matches the structure of first sub-tree and the structure of the second sub-tree. - View Dependent Claims (9)
-
-
10. A system comprising:
-
template storage; and template induction logic that is operable to; compare, one document at a time, a structure of documents in a training set with a structure of an initial template; select at least one of the documents based on the comparing; generalize the initial template to create a generalized template that has a structure that matches each of the selected documents; wherein to generalize the initial template to create the generalized template the logic is operable to add one or more operators to the initial template from a set of operators to create the generalized template, wherein the one or more operators includes a first operator that indicates that only one of a plurality of subtrees below the operator is allowed to occur at a position in the selected documents that corresponds to the position of the first operator in the generalized template; and store the generalized template in the template storage. - View Dependent Claims (11, 12)
-
-
13. A non-transitory computer-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, cause the one or more processors to perform:
-
comparing, one document at a time, a structure of documents in a training set with a structure of an initial template; selecting at least one of the documents based on the comparing; generalizing the initial template to create a generalized template that has a structure that matches each of the selected documents; wherein generalizing the initial template to create the generalized template includes adding one or more operators to the initial template from a set of operators to create the generalized template, wherein the one or more operators includes a first operator that indicates that only one of a plurality of subtrees below the operator is allowed to occur at a position in the selected documents that corresponds to the position of the first operator in the generalized template. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable storage medium storing one or more sequences of instructions which, when executed by one or more processors, cause the one or more processors to perform:
-
comparing a structure of a first sub-tree, of a plurality of sub-trees associated with a template, with a structure of a second sub-tree of the plurality of sub-trees, wherein the first sub-tree and the second sub-tree are not structurally identical; determining a degree of structural similarity between the first sub-tree and the second sub-tree; and in response to determining that the degree of structural similarity is more than a specified threshold, generalizing the structure of the first sub-tree to generate a modified first sub-tree in the template, wherein the structure of the modified first sub-tree matches the structure of first sub-tree and the structure of the second sub-tree. - View Dependent Claims (21)
-
Specification