System and method of automatic wrapper grammar generation
First Claim
1. A method for generating a wrapper grammar for a file having a structure of a particular format, comprising:
- providing at least one sample file of the particular format, wherein the particular format comprises a plurality of string tokens;
processing the at least one sample file of the particular format by annotating attributable tokens with attributes from a set of attributes, wherein a token is attributable if it can be assigned to an attribute, to generate an annotated sample set;
evaluating the annotated sample set to determine if automatic wrapper grammar generation is possible by determining if all attributes in the annotated sample set are distinguishable from one another; and
if automatic wrapper grammar generation is possible, generating a wrapper grammar for the files having a structure of the particular format.
7 Assignments
0 Petitions
Accused Products
Abstract
A method for generating a wrapper grammar for a file having a structure of a particular format includes providing at least one sample file of the particular format, where the particular format comprises a plurality of string tokens. Each sample file includes a plurality of tokens (data strings) which may be actual data from the document, an HTML tag or some other grammatical separator. The sample file of the particular format is then processed by annotating attributable tokens with a user-defined attribute, such as Author, Title, etc. from a set of attributes to form an annotated sample set. The annotated sample set is then evaluated to determine if wrapper grammar generation is possible, and if it is possible, a wrapper grammar for the files having a structure of the particular format is generated. Preferably, the annotated sample set is evaluated by determining if all attributes in the annotated sample set are distinguishable from one another.
55 Citations
18 Claims
-
1. A method for generating a wrapper grammar for a file having a structure of a particular format, comprising:
-
providing at least one sample file of the particular format, wherein the particular format comprises a plurality of string tokens;
processing the at least one sample file of the particular format by annotating attributable tokens with attributes from a set of attributes, wherein a token is attributable if it can be assigned to an attribute, to generate an annotated sample set;
evaluating the annotated sample set to determine if automatic wrapper grammar generation is possible by determining if all attributes in the annotated sample set are distinguishable from one another; and
if automatic wrapper grammar generation is possible, generating a wrapper grammar for the files having a structure of the particular format. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
generating a set of reverse prefixes for each attribute ti;
partitioning the attribute set into equivalence classes di, wherein no two equivalence classes have common reverse prefixes; and
if the equivalence classes are equal to the attributes, di=ti, then all attributes ti are distinguishable.
-
-
3. The method of claim 2, wherein if there is only one equivalence class d1 and d1 is equal to the set of attributes, then the wrapper grammar cannot be generated.
-
4. The method of claim 1, wherein the file having a structure of a particular format comprises an HTML file.
-
5. The method of claim 1, wherein the file having a structure of a particular format comprises an ODBC-compliant file.
-
6. The method of claim 1, wherein the file having a structure of a particular format comprises a DMA-compliant file.
-
7. The method of claim 1, further comprising:
-
providing a second sample of the particular format;
processing the second sample file by annotating attributable tokens with new attributes from the set of attributes to generate a second annotated sample set;
evaluating the second annotated sample set to determine if wrapper generation is possible; and
if wrapper generation is possible, generating an incremental wrapper grammar.
-
-
8. The method of claim 2, wherein if an equivalence class di contains an attribute ti and t0=void, attribute ti is consumed by void and cannot be recognized in response files.
-
9. The method of claim 3, wherein if some of the equivalence classes are equal to the attributes, di=ti, then some attributes ti are distinguishable and the wrapper grammar may be partially generated.
-
10. The method of claim 9, further comprising normalizing the set of distinguishable attributes and generating a wrapper grammar based on the set of normalized set of attributes.
-
11. A method for generating a wrapper for an HTML file having a structure of a particular format, comprising:
-
providing at least one sample HTML file of the particular format, wherein the particular format comprises a plurality of string tokens;
processing the at least one sample HTML file of the particular format by annotating attributable tokens with attributes from a set of attributes, wherein a token is attributable if it can be assigned to an attribute, to generate an annotated sample set;
evaluating the annotated sample set to determine if wrapper generation is possible by determining if all attributes in the annotated sample set are distinguishable from one another; and
if wrapper generation is possible, generating a wrapper for the HTML files having a structure of the particular format.
-
-
12. A system for generating a wrapper for a file having a structure of a particular format, comprising:
-
a memory storing at least one sample file of the particular format, wherein the particular format comprises a plurality of string tokens; and
a processor for processing the at least one sample file of the particular format by annotating attributable tokens with attributes from a set of attributes, wherein a token is attributable if it can be assigned to an attribute, to generate an annotated sample set;
for evaluating the annotated sample set to determine if wrapper generation is possible by determining if all attributes in the annotated sample set are distinguishable from one another; and
if wrapper generation is possible, generating a wrapper for the files having a structure of the particular format.- View Dependent Claims (13, 14, 15, 16, 17)
generating a set of reverse prefixes for each attribute ti;
partitioning the attribute set into equivalence classes di, wherein no two equivalence classes have common reverse prefixes; and
if the equivalence classes are equal to the attributes, di=ti, then all attributes ti are distinguishable.
-
-
14. The system of claim 13, wherein if there is only one equivalence class d1 and di is equal to the set of attributes, then the wrapper grammar cannot be generated.
-
15. The system of claim 13, wherein if an equivalence class di contains an attribute t1 and t0=void, attribute ti is consumed by void and cannot be recognized in response files.
-
16. The system of claim 13, wherein if some of the equivalence classes are equal to the attributes, di=ti, then some attributes ti are distinguishable and the wrapper grammar may be partially generated.
-
17. The system of claim 16, wherein the processor normalized the set of distinguishable attributes and generates a wrapper grammar based on the set of normalized set of attributes.
-
18. An article of manufacture for use in a system that includes a processor for accessing data on a storage medium using a storage medium access device, comprising:
-
a storage medium; and
instruction data stored on the storage medium, the instruction data defining a sequence of instructions for access by the processor using the storage medium access device, wherein the sequence of instructions comprises;
providing at least one sample file of the particular format, wherein the particular format comprises a plurality of string tokens;
processing the at least one sample file of the particular format by annotating attributable tokens with attributes from a set of attributes, wherein a token is attributable if it can be assigned to an attribute, to generate an annotated sample set;
evaluating the annotated sample set to determine if wrapper grammar generation is possible by determining if all attributes in the annotated sample set are distinguishable from one another; and
if wrapper grammar generation is possible, generating a wrapper grammar for the files having a structure of the particular format.
-
Specification