Automated screen scraping via grammar induction

US 8,838,625 B2
Filed: 04/03/2009
Issued: 09/16/2014
Est. Priority Date: 04/03/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of extracting information from a data source, comprising:

intercepting display information transmitted to a computer-implemented display device;

wherein the display information is from the data source;

wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device;

inducing a grammar via statistical analysis of the intercepted display information;

wherein inducing a grammar includes determining how to break up the particular visual content into component parts;

wherein determining how to break up the particular visual content into component parts includes;

identifying a plurality of tokens in the particular visual content;

for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and

determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens;

generating a parser corresponding to the induced grammar; and

performing screen scraping using the generated parser to produce a sequence of return values representing the extracted information;

wherein the method is performed by one or more computing devices.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a computer-readable medium are provided which perform screen scraping via grammar induction. The computer-readable medium stores instructions of the method, the instructions directing a computer processor to intercept display information transmitted to a computer-implemented display device representing information stored in a data source; induce a grammar via statistical analysis of the intercepted display information; provide the grammar to a parser-generator to generate a parser corresponding to the induced grammar; and perform screen scraping using the generated parser.

Citations

16 Claims

1. A computer-implemented method of extracting information from a data source, comprising:
- intercepting display information transmitted to a computer-implemented display device;
  
  wherein the display information is from the data source;
  
  wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device;
  
  inducing a grammar via statistical analysis of the intercepted display information;
  
  wherein inducing a grammar includes determining how to break up the particular visual content into component parts;
  
  wherein determining how to break up the particular visual content into component parts includes;
  
  identifying a plurality of tokens in the particular visual content;
  
  for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and
  
  determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens;
  
  generating a parser corresponding to the induced grammar; and
  
  performing screen scraping using the generated parser to produce a sequence of return values representing the extracted information;
  
  wherein the method is performed by one or more computing devices.
- View Dependent Claims (2, 3, 4, 5, 10)
- - 2. The method of claim 1, wherein inducing a grammar further comprises:
    - forming a histogram of the tokens according to the frequency of each token; and
      
      segmenting the text into records starting with tokens with lower frequency.
  - 3. The method of claim 1, wherein generating the parser further comprises:
    - receiving the induced grammar;
      
      representing the induced grammar using a regular language; and
      
      programming state transitions of a finite state machine to correspond to relationships represented in the regular language.
  - 4. The method of claim 3, wherein:
    - the particular visual content includes fields; and
      
      the generated parser produces the sequence of return values based, at least in part, on annotations which specify which fields to extract from the particular visual content, and how the fields map to the return values.
  - 5. The method of claim 3, wherein screen scraping further comprises:
    - receiving the intercepted display information as an input to the finite state machine; and
      
      producing the return values representing the extracted information.
  - 10. The method of claim 1 wherein the particular visual content includes text organized in an underlying hierarchical structure, and the method further comprises:
    - identifying the underlying hierarchical structure by recursively segmenting the text into records, starting with tokens, of the plurality of tokens, that have lower frequencies.

6. An apparatus for extracting information from a computer-based data source, comprising:
- means for intercepting display information transmitted to a computer-implemented display device representing information stored in a data source;
  
  wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device; and
  
  a computer processor executing a sequence of instructions configuring the computer processor as;
  
  a grammar inducer producing a representation of a hierarchical structure that underlies the display information,wherein the hierarchical structure produced by the grammar inducer is described by a regular language;
  
  wherein the grammar inducer determines how to break up the particular visual content into component parts;
  
  wherein the grammar inducer determines how to break up the particular visual content into component parts by;
  
  identifying a plurality of tokens in the particular visual content;
  
  for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and
  
  determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens;
  
  a parser-generator receiving the representation and configured to generate a parser corresponding thereto; and
  
  a screen scraper configured to extract the information from the intercepted display information using the generated parser.
- View Dependent Claims (7, 8, 9)
- - 7. The apparatus of claim 6, wherein the sequence of instructions for the grammar inducer further comprise instructions configuring the processor to perform:
    - forming a histogram of tokens according to the frequency of each token; and
      
      segmenting the text into records starting with tokens with lower frequency.
  - 8. The apparatus of claim 7, wherein the sequence of instructions for the grammar inducer further comprise instructions configuring the processor to perform:
    - recursively segmenting the records to identify the underlying hierarchical structure.
  - 9. The apparatus of claim 6, wherein:
    - the particular visual content includes fields; and
      
      the sequence of instructions for the parser-generator further comprise instructions configuring the processor to perform;
      
      producing the sequence of return values based, at least in part, on annotations which specify fields to extract from the particular visual content, and how the fields map to the return values.

11. A non-transitory computer-readable storage medium storing instructions for extracting information from a data source, wherein the instructions include instructions which, when executed by one or more processors, cause the one or more processors to perform a method comprising the steps of:
- intercepting display information transmitted to a computer-implemented display device;
  
  wherein the display information is from the data source;
  
  wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device;
  
  inducing a grammar via statistical analysis of the intercepted display information;
  
  wherein inducing a grammar includes determining how to break up the particular visual content into component parts;
  
  wherein determining how to break up the particular visual content into component parts includes;
  
  identifying a plurality of tokens in the particular visual content;
  
  for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and
  
  determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens;
  
  generating a parser corresponding to the induced grammar; and
  
  performing screen scraping using the generated parser to produce a sequence of return values representing the extracted information.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The non-transitory computer-readable storage medium of claim 11, wherein inducing a grammar further comprises:
    - forming a histogram of the tokens according to the frequency of each token; and
      
      segmenting the text into records starting with tokens with lower frequency.
  - 13. The non-transitory computer-readable storage medium of claim 11, wherein generating the parser further comprises:
    - receiving the induced grammar;
      
      representing the induced grammar using a regular language; and
      
      programming state transitions of a finite state machine to correspond to relationships represented in the regular language.
  - 14. The non-transitory computer-readable storage medium of claim 13, wherein:
    - the particular visual content includes fields; and
      
      the generated parser produces the sequence of return values based, at least in part, on annotations which specify which fields to extract from the particular visual content, and how the fields map to the return values.
  - 15. The non-transitory computer-readable storage medium of claim 13, wherein screen scraping further comprises:
    - receiving the intercepted display information as an input to the finite state machine; and
      
      producing the return values representing the extracted information.
  - 16. The non-transitory computer readable storage medium of claim 11, wherein the particular visual content includes text organized in an underlying hierarchical structure, and the method further comprises:
    - identifying the underlying hierarchical structure by recursively segmenting the text into records, starting with tokens, of the plurality of tokens, that have lower frequencies.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Xu, Zhichen, Fu, Yun, Yen, Peter, Song, Ning
Primary Examiner(s)
Bhatia, Ajay
Assistant Examiner(s)
MUELLER, KURT A

Application Number

US12/417,773
Publication Number

US 20100256974A1
Time in Patent Office

1,992 Days
Field of Search

707/741, 707/755
US Class Current

707/755
CPC Class Codes

G06F 40/186 Templates

G06F 40/216 using statistical methods

Automated screen scraping via grammar induction

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Automated screen scraping via grammar induction

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links