Method and system for extracting web query interfaces
First Claim
1. A computer readable storage medium encoded with a computer program to be executed by a computer for extracting semantic information about a plurality of documents autonomously created by different sources and being accessible via a computer network, said computer readable storage medium comprising:
- a tokenizer for causing the computer to generate a set of tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image from one of the plurality of autonomously created documents;
a grammar mechanism for causing the computer to derive a non-prescribed visual grammar from the set of tokens to represent a hidden syntax convention of a visual presentation; and
a best-effort parser for causing the computer to apply the derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret a maximum subset of the set of tokens,wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or heterogeneous Web documents; and
said grammar is a five tuple <
Σ
, N, s, Pd, Pf>
where Σ
is a set of terminal symbols, N is a set of nonterminal symbols, sε
N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer program product being embodied on a computer readable medium for extracting semantic information about a plurality of documents being accessible via a computer network, the computer program product including computer-executable instructions for: generating a plurality of tokens from at least one of the documents, each token being indicative of a displayed item and a corresponding position; and, constructing at least one parse tree indicative of a semantic structure of the at least one document from the tokens dependently upon a grammar being indicative of presentation conventions.
-
Citations
31 Claims
-
1. A computer readable storage medium encoded with a computer program to be executed by a computer for extracting semantic information about a plurality of documents autonomously created by different sources and being accessible via a computer network, said computer readable storage medium comprising:
-
a tokenizer for causing the computer to generate a set of tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image from one of the plurality of autonomously created documents; a grammar mechanism for causing the computer to derive a non-prescribed visual grammar from the set of tokens to represent a hidden syntax convention of a visual presentation; and a best-effort parser for causing the computer to apply the derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret a maximum subset of the set of tokens, wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or heterogeneous Web documents; and said grammar is a five tuple <
Σ
, N, s, Pd, Pf>
where Σ
is a set of terminal symbols, N is a set of nonterminal symbols, sε
N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method for extracting semantic information about a plurality of electronic documents autonomously created by different sources and being accessible via a computer network, comprising:
-
accessing an electronic document via the computer network; generating a set of tokens by a computer, the tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image of the electronic document; deriving a non-prescribed visual grammar from the set of tokens by the computer, to represent a hidden syntax convention of visual presentation in the displayed document image; and applying said derived visual grammar by the computer to construct multiple parse trees that represent semantic structure of the electronic document and interpret a maximum subset of the set of tokens, wherein said non-prescribed visual grammar is derived from autonomous or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation, and said derived non-prescribed visual grammar is a five tuple <
Σ
, N, s, Pd, Pf>
where Σ
is a set of terminal symbols, N is a set of nonterminal symbols, sε
N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer implemented system for extracting semantic information about a plurality of electronic documents autonomously created by different sources and being accessible via a computer network, comprising:
-
a programmed computer including a tokenizer for generating a set of tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image from one of the plurality of autonomously created electronic documents accessed via the computer network; the programmed computer including a grammar mechanism for deriving a non-prescribed visual grammar from the set of tokens to represent a hidden syntax convention of a visual presentation; and the programmed computer including a best-effort parser for applying the derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret a maximum subset of the set of tokens, wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or heterogeneous Web documents, and said grammar is a five tuple <
Σ
, N, s, Pd, Pf>
where Σ
is a set of terminal symbols, N is a set of nonterminal symbols, sε
N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31)
-
Specification