Method and system for extracting web query interfaces

US 7,552,116 B2
Filed: 08/06/2004
Issued: 06/23/2009
Est. Priority Date: 08/06/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer readable storage medium encoded with a computer program to be executed by a computer for extracting semantic information about a plurality of documents autonomously created by different sources and being accessible via a computer network, said computer readable storage medium comprising:

a tokenizer for causing the computer to generate a set of tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image from one of the plurality of autonomously created documents;

a grammar mechanism for causing the computer to derive a non-prescribed visual grammar from the set of tokens to represent a hidden syntax convention of a visual presentation; and

a best-effort parser for causing the computer to apply the derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret a maximum subset of the set of tokens,wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or heterogeneous Web documents; and

said grammar is a five tuple <

Σ

, N, s, Pd, Pf>

where Σ

is a set of terminal symbols, N is a set of nonterminal symbols, sε

N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer program product being embodied on a computer readable medium for extracting semantic information about a plurality of documents being accessible via a computer network, the computer program product including computer-executable instructions for: generating a plurality of tokens from at least one of the documents, each token being indicative of a displayed item and a corresponding position; and, constructing at least one parse tree indicative of a semantic structure of the at least one document from the tokens dependently upon a grammar being indicative of presentation conventions.

Citations

31 Claims

1. A computer readable storage medium encoded with a computer program to be executed by a computer for extracting semantic information about a plurality of documents autonomously created by different sources and being accessible via a computer network, said computer readable storage medium comprising:
- a tokenizer for causing the computer to generate a set of tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image from one of the plurality of autonomously created documents;
  
  a grammar mechanism for causing the computer to derive a non-prescribed visual grammar from the set of tokens to represent a hidden syntax convention of a visual presentation; and
  
  a best-effort parser for causing the computer to apply the derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret a maximum subset of the set of tokens,wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or heterogeneous Web documents; and
  
  said grammar is a five tuple <
  
  Σ
  
  , N, s, Pd, Pf>
  
  where Σ
  
  is a set of terminal symbols, N is a set of nonterminal symbols, sε
  
  N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer readable storage medium of claim 1, wherein said tokenizer for causing the computer to generate said set of tokens comprises computer executable instructions for:
    - receiving a document in a mark-up language;
      
      rendering said document by a layout engine into a document image;
      
      extracting tokens indicative of DOM nodes with visual properties in the rendered document image; and
      
      storing said tokens in a memory with visual properties.
  - 3. The computer readable storage medium of claim 2, wherein said visual properties include, for each token, the coordinates of the token in the displayed document image, the value of the token, the DOM path of the token in the DOM tree, the type of the token, and the name of the token.
  - 4. The computer readable storage medium of claim 1 wherein said hidden syntax convention of visual presentation is derived by performing the following steps:
    - observing the visual relationship of tokens on how they form semantic units, and deriving visual patterns to represent the semantic unit; and
      
      observing precedence between different conflicting patterns and deriving pattern preference to represent their precedence.
  - 5. The computer readable storage medium of claim 1, wherein said production rules are a four tuple (H, M, C, F), where Hε
    - N is the head of the production M⊂
      
      Σ
      
      ∪
      
      N is a multiset of symbols, C is a boolean constraint defined on M and F is a constructor defined on M.
  - 6. The computer readable storage medium of claim 1, wherein said preference rules are a three tuple <
    - I,U,W>
      
      where I identifies the types of conflicting instances, U defines a conflicting condition which said rule will handle, and W specifies the winning criteria to solve the conflict by picking one instance from the conflicting ones.
  - 7. The computer readable storage medium of claim 6 wherein said best-effort parser causes the computer to perform a procedure comprising of the steps of:
    - building a schedule graph for determining the order of applying production rules;
      
      building multiple parse trees simultaneously by grouping tokens using production rules in determined orders;
      
      pruning useless parse trees by checking preference rules; and
      
      outputting multiple potential useful parse trees that maximally cover the tokens in the document.
  - 8. The computer readable storage medium of claim 7 wherein the step of building a schedule graph comprises:
    - adding dependency edges from head symbol to each component symbol for each production rule;
      
      adding restriction edges between two symbols for each preference rule;
      
      transforming restriction edges when graph is acyclic; and
      
      removing restriction edges if graph is still acyclic after said transformation.
  - 9. The computer readable storage medium of claim 7 wherein the step of building multiple parse trees compromises the steps of:
    - getting productions in the order determined by the schedule graph; and
      
      applying each production (H, M, C, F), in the identified order, to generate new instances of the head H, from the instances of the components M, by using the construction function F, if the components M satisfy constraint C.
  - 10. The computer readable storage medium of claim 7, wherein the step of pruning useless parse trees compromises the steps of:
    - for each newly generated instance, checking the conflicting condition of all preference rules to find all conflicts with existing instances; and
      
      applying the winning criteria for conflict resolution to remove false instances.
  - 11. The computer readable storage medium of claim 1, wherein said Web documents are Web query forms.

12. A method for extracting semantic information about a plurality of electronic documents autonomously created by different sources and being accessible via a computer network, comprising:
- accessing an electronic document via the computer network;
  
  generating a set of tokens by a computer, the tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image of the electronic document;
  
  deriving a non-prescribed visual grammar from the set of tokens by the computer, to represent a hidden syntax convention of visual presentation in the displayed document image; and
  
  applying said derived visual grammar by the computer to construct multiple parse trees that represent semantic structure of the electronic document and interpret a maximum subset of the set of tokens,wherein said non-prescribed visual grammar is derived from autonomous or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation, andsaid derived non-prescribed visual grammar is a five tuple <
  
  Σ
  
  , N, s, Pd, Pf>
  
  where Σ
  
  is a set of terminal symbols, N is a set of nonterminal symbols, sε
  
  N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The method of claim 12, wherein generating said set of tokens comprises:
    - receiving data representing a document in a mark-up language;
      
      rendering said document by a layout engine into a document image;
      
      extracting tokens indicative of DOM nodes with visual properties in the rendered document image; and
      
      storing said tokens in a memory with visual properties.
  - 14. The method of claim 13, wherein said visual properties include for each token, the coordinates of the token in the displayed document image, the value of the token, the DOM path of the token in the DOM tree, the type of the token, and the name of the token.
  - 15. The method of claim 12 wherein said hidden syntax convention of visual presentation is derived by performing the following steps:
    - observing the visual relationship of tokens on how they form semantic units, and deriving visual patterns to represent the semantic unit; and
      
      observing precedence between different conflicting patterns and derive pattern preference to represent their precedence.
  - 16. The method of claim 12, wherein said production rules are a four tuple (H, M, C, F), where Hε
    - N is the head of the production, M⊂
      
      Σ
      
      ∪
      
      N is a multiset of symbols, C is a boolean constraint defined on M and F is a constructor defined on M.
  - 17. The method of claim 12, wherein said preference rules are a three tuple <
    - I,U,W>
      
      where I identifies the types of conflicting instances, U defines a conflicting condition, which said rule will handle, and W specifies the winning criteria to solve the conflict by picking one instance from the conflicting ones.
  - 18. The method of claim 17 wherein said step of applying said derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret as many tokens as said derived visual grammar can comprises the additional steps of:
    - building a schedule graph for determining the order of applying production rules;
      
      building multiple parse trees simultaneously by grouping tokens using production rules in determined orders;
      
      pruning useless parse trees by checking preference rules; and
      
      outputting multiple potential useful parse trees that maximally cover the tokens in the document.
  - 19. The method of claim 18 wherein the step of building a schedule graph comprises:
    - adding dependency edges from head symbol to each component symbol for each production rule;
      
      adding restriction edges between two symbols for each preference rule;
      
      transforming restriction edges when graph is acyclic; and
      
      removing restriction edges if graph is still acyclic after said transformation.
  - 20. The method of claim 18 wherein the step of building multiple parse trees compromises the steps of:
    - getting productions in the order determined by the schedule graph; and
      
      applying each production (H, M, C, F), in the identified order, to generate new instances of the head H, from the instances of the components M, by using the construction function F, if the components M satisfy constraint C.
  - 21. The method of claim 18, wherein the step of pruning useless parse trees compromises the additional steps of:
    - for each newly generated instance, checking the conflicting condition of all preference rules to find all conflicts with existing instances; and
      
      applying the winning criteria for conflict resolution to remove false instances.
  - 22. The method of claim 12, wherein said Web documents are Web query forms.

23. A computer implemented system for extracting semantic information about a plurality of electronic documents autonomously created by different sources and being accessible via a computer network, comprising:
- a programmed computer including a tokenizer for generating a set of tokens indicative of document object model (DOM) nodes associated with visual information in a displayed document image from one of the plurality of autonomously created electronic documents accessed via the computer network;
  
  the programmed computer including a grammar mechanism for deriving a non-prescribed visual grammar from the set of tokens to represent a hidden syntax convention of a visual presentation; and
  
  the programmed computer including a best-effort parser for applying the derived visual grammar to construct multiple parse trees that represent semantic structure of the document and interpret a maximum subset of the set of tokens,wherein said non-prescribed visual grammar is derived from a plurality of autonomously created or heterogeneous Web documents to represent the hidden syntax convention of the visual presentation common among the plurality of autonomously created or heterogeneous Web documents, andsaid grammar is a five tuple <
  
  Σ
  
  , N, s, Pd, Pf>
  
  where Σ
  
  is a set of terminal symbols, N is a set of nonterminal symbols, sε
  
  N is a start symbol, Pd is a set of production rules that represent visual patterns and Pf is a set of preference rules that represent pattern precedence.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31)
- - 24. The system of claim 23, wherein said visual properties include, for each token, the coordinates of the token in the displayed document image, the value of the token, the DOM path of the token in the DOM tree, the type of the token, and the name of the token.
  - 25. The system of claim 23 wherein said hidden syntax convention of visual presentation is derived by performing the following steps:
    - observing the visual relationship of tokens on how they form semantic units, and deriving visual patterns to represent the semantic unit; and
      
      observing precedence between different conflicting patterns and deriving pattern preference to represent their precedence.
  - 26. system of claim 23, wherein said production rules are a four tuple (H, M, C, F), where Hε
    - N is the head of the production, M⊂
      
      Σ
      
      ∪
      
      N is a multiset of symbols, C is a Boolean constraint defined on M and F is a constructor defined on M.
  - 27. The system of claim 23, wherein said preference rules are a three tuple <
    - I,U,W>
      
      where I identifies the types of conflicting instances, U defines a conflicting condition which said rule will handle, and W specifies a winning criteria to solve the conflict by picking one instance from the conflicting ones.
  - 28. The system of claim 27 wherein said best-effort parser performs a procedure comprising the steps of:
    - building a schedule graph for determining the order of applying production rules;
      
      building multiple parse trees simultaneously by grouping tokens using production rules in determined orders;
      
      pruning useless parse trees by checking preference rules; and
      
      outputting multiple potential useful parse trees that maximally cover the tokens in the document.
  - 29. The system of claim 28, wherein the step of building a schedule graph comprises:
    - adding dependency edges from head symbol to each component symbol for each production rule;
      
      adding restriction edges between two symbols for each preference rule;
      
      transforming restriction edges when graph is acyclic; and
      
      removing restriction edges if graph is still acyclic after said transformation.
  - 30. system of claim 29, wherein the step of building multiple parse trees compromises the steps of:
    - getting productions in the order determined by the schedule graph; and
      
      applying each production (H, M, C, F), in the identified order, to generate new instances of the head H, from the instances of the components M, by using the construction function F, if the components M satisfy constraint C.
  - 31. The system of claim 29, wherein the step of pruning useless parse trees compromises the steps of:
    - for each newly generated instance, checking the conflicting condition of all preference rules to find all conflicts with existing instances; and
      
      applying the winning criteria for conflict resolution to remove false instances.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Board of Trustees of The University of Illinois
Original Assignee
Board of Trustees of The University of Illinois
Inventors
Zhang, Zhen, He, Bin, Chang, Kevin Chen-Chuan
Primary Examiner(s)
Cottingham; John R.
Assistant Examiner(s)
Khakhar; Nirav K

Application Number

US10/913,721
Publication Number

US 20060031202A1
Time in Patent Office

1,782 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 40/174   Form filling; Merging

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/30   Semantic analysis

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Method and system for extracting web query interfaces

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for extracting web query interfaces

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links