APPROACHES FOR THE UNSUPERVISED CREATION OF STRUCTURAL TEMPLATES FOR ELECTRONIC DOCUMENTS
First Claim
1. A method for creating templates for electronic documents, comprising:
- extracting, using a first template, one or more attributes from a first document;
identifying a second document that contains a particular attribute of said one or more attributes;
annotating said second document, using said first template, to create an annotated document;
generating a new template for said annotated document, wherein said new template facilitates extraction of information from said annotated document; and
storing said new template on a volatile or non-volatile computer-readable medium.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for creating templates for electronic documents is provided. One or more attributes are extracted, using a seed template, from a first document, such as a web page. A second document that contains a particular attribute, extracted from the first document, is identified. The second document may be in a different cluster than the first document. The second document is annotated, using an extracted attribute, to create an annotated document. The second document is annotated without human intervention. A new template for the annotated document is generated. The new template facilitates extraction of information from the annotated document. The new template may be used to extract additional attributes from all documents in the cluster of documents of which the second document is a member. The process may continue over numerous iterations to generate a large number of templates in an automated fashion.
-
Citations
24 Claims
-
1. A method for creating templates for electronic documents, comprising:
-
extracting, using a first template, one or more attributes from a first document; identifying a second document that contains a particular attribute of said one or more attributes; annotating said second document, using said first template, to create an annotated document; generating a new template for said annotated document, wherein said new template facilitates extraction of information from said annotated document; and storing said new template on a volatile or non-volatile computer-readable medium. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A machine-readable storage medium storing one or more sets of instructions, which when executed, cause:
-
extracting, using a first template, one or more attributes from a first document; identifying a second document that contains a particular attribute of said one or more attributes; annotating said second document, using said first template, to create an annotated document; and generating a new template for said annotated document, wherein said new template facilitates extraction of information from said annotated document. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification