Method and apparatus of automatically generating a procedure for extracting information from textual information sources
First Claim
1. A method of automatically generating a procedure for extracting information from textual information sources comprising the steps of:
- collecting a set of example pages from the information sources;
identifying text fragments of interest from the example pages, and constructing the procedure using delimiters of the text fragments by;
identifying possible delimiters of the text fragment;
considering different combinations of possible delimiters; and
using the combination which best matches the example pages to generate the procedure.
2 Assignments
0 Petitions
Accused Products
Abstract
A procedure is disclosed for automatically constructing wrappers for performing information-extraction from sites such as Internet resources that display relevant information, interspersed with extraneous text fragments, such as HTML formatting commands or advertisements. The procedure has three basic steps. First, a set of example pages are collected with a subroutine named GatherExamples. Gather Examples is provided with information describing how to pose example queries to the site whose wrapper is to be learned. Second, these example pages are labeled by a subroutine named LabelExamples—i.e., the information to be extracted from each example is identified for use in the third step. The LabelExamples subroutine uses a general framework for labeling pages using site-specific heuristics called recognizers, as well as allowing users to correct and modify the recognized instances. Finally, the labeled example pages are passed to a BuildWrapper subroutine, which constructs a wrapper.
75 Citations
24 Claims
-
1. A method of automatically generating a procedure for extracting information from textual information sources comprising the steps of:
-
collecting a set of example pages from the information sources;
identifying text fragments of interest from the example pages, and constructing the procedure using delimiters of the text fragments by;
identifying possible delimiters of the text fragment;
considering different combinations of possible delimiters; and
using the combination which best matches the example pages to generate the procedure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
extracting a potential text fragment of interest from the set of example pages; - and
correcting the potential text fragment of interest to obtain a text fragment of interest.
-
-
12. The method of claim 11, further comprising the step creating a tuple from the text fragment of interest.
-
13. The method of claim 12, wherein the tuple comprises a row of a table data structure.
-
14. The method of claim 11 further comprising the step of applying heuristic techniques to identify the potential text fragment of interest.
-
15. The method of claim 11, wherein the extracting step employs recognizer algorithms to identify the potential text fragment of interest.
-
16. The method of claim 11, wherein the correcting step includes displaying the potential text fragment of interest for manual review.
-
17. The method of claim 1, wherein the identifying step comprises:
-
extracting a text fragment of interest from a label created in the labeling step; and
parsing an example page from the set of example pages to extract potential delimiters by reference to the text fragment of interest.
-
-
18. The method of claim 1, wherein the step of considering different combinations of possible delimiters includes
applying a set of requirements to a potential delimiter; - and
eliminating a potential delimiter that does not adhere to the set of requirements.
- and
-
19. The method of claim 1, further comprising the step of selecting a procedure class, the procedure class governing the identifying, testing, and using steps.
-
20. The method of claim 19, wherein the procedure class is an LR procedure class.
-
21. The method of claim 19, wherein the procedure class is an HLRT procedure class.
-
22. The method of claim 19, wherein the procedure class is an OCLR procedure class.
-
23. A method of automatically generating a procedure for extracting information from a textual information source comprising the steps of
collecting a set of example pages from the information sources; -
identifying text fragments of interest from the example pages, and constructing the procedure using delimiters of the text fragments by;
defining a set of delimiters for a wrapper class, defining a generic procedure specifying how the delimiters are used for information extraction, determining a set of candidate values for each of the wrapper class'"'"'s delimiters, defining a set of constraints that must be satisfied for a wrapper in the wrapper class to be correct and finding a combination of candidate values for each delimiter that satisfies the set of constraints. - View Dependent Claims (24)
identifying delimiters of the text fragments and using such delimiters to generate the procedure.
-
Specification