Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml
First Claim
1. A wrapper generation system comprising:
- a network including at least one example document and at least one production document; and
a visual builder that is adapted to interactively generate a wrapper program by letting a user visually and interactively declare at least one desired property of example-elements to be extracted from the example document thereby creating user declarations;
a program evaluator adapted to execute a wrapper program over the production document and to extract desired production elements from the production document and to translate the production elements into XML yielding an XML companion of the production document.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and a system for information extraction from Web pages formatted with markup languages such as HTML [8]. A method and system for interactively and visually describing information patterns of interest based on visualized sample Web pages [5,6,16-29]. A method and data structure for representing and storing these patterns [1]. A method and system for extracting information corresponding to a set of previously defined patterns from Web pages [2], and a method for transforming the extracted data into XML is described. Each pattern is defined via the (interactive) specification of one or more filters. Two or more filters for the same pattern contribute disjunctively to the pattern definition [3], that is, an actual pattern describes the set of all targets specified by any of its filters. A method and for extracting relevant elements from Web pages by interpreting and executing a previously defined wrapper program of the above form on an input Web page [9-14] and producing as output the extracted elements represented in a suitable data structure. A method and system for automatically translating said output into XML format by exploiting the hierarchical structure of the patterns and by using pattern names as XML tags is described.
-
Citations
125 Claims
-
1. A wrapper generation system comprising:
-
a network including at least one example document and at least one production document; and
a visual builder that is adapted to interactively generate a wrapper program by letting a user visually and interactively declare at least one desired property of example-elements to be extracted from the example document thereby creating user declarations;
a program evaluator adapted to execute a wrapper program over the production document and to extract desired production elements from the production document and to translate the production elements into XML yielding an XML companion of the production document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
-
-
56. A method for visual and interactive generation of a wrapper for documents and for automated information extraction comprising:
-
letting a user visually and interactively declare at least one desired property of elements to be extracted from the document thereby creating user declarations;
translating the user declarations into a wrapper;
executing a wrapper over the document; and
extracting elements from said documents that match said user declarations.
-
-
57. A method for visual and interactive generation of wrappers for documents, and for automated information extraction comprising:
-
defining extraction patterns on at least one example page, by visually and interactively selecting example-elements occurring on the example-page;
visually and interactively declaring properties of the extraction patterns;
generating a wrapper;
applying the wrapper to at least one production document; and
automatically extracting matching instances of the extraction patterns from the production documents. - View Dependent Claims (58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121)
-
-
122. A method for interpreting an extended logic program over Web pages, comprising:
-
setting variables and terms of said logic program so that they form a range over nodes parsing trees of said Web pages, setting an identification of each Web page to an extensional database whose data elements are nodes of the parsing tree of said Web page; and
establishing parent-child relationship between the nodes of said parsing tree to be a binary relation, and whose data elements are ordered according to document order. - View Dependent Claims (123, 124, 125)
-
Specification