Automated screen scraping via grammar induction
First Claim
Patent Images
1. A computer-implemented method of extracting information from a data source, comprising:
- intercepting display information transmitted to a computer-implemented display device;
wherein the display information is from the data source;
wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device;
inducing a grammar via statistical analysis of the intercepted display information;
wherein inducing a grammar includes determining how to break up the particular visual content into component parts;
wherein determining how to break up the particular visual content into component parts includes;
identifying a plurality of tokens in the particular visual content;
for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and
determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens;
generating a parser corresponding to the induced grammar; and
performing screen scraping using the generated parser to produce a sequence of return values representing the extracted information;
wherein the method is performed by one or more computing devices.
9 Assignments
0 Petitions
Accused Products
Abstract
A method and a computer-readable medium are provided which perform screen scraping via grammar induction. The computer-readable medium stores instructions of the method, the instructions directing a computer processor to intercept display information transmitted to a computer-implemented display device representing information stored in a data source; induce a grammar via statistical analysis of the intercepted display information; provide the grammar to a parser-generator to generate a parser corresponding to the induced grammar; and perform screen scraping using the generated parser.
-
Citations
16 Claims
-
1. A computer-implemented method of extracting information from a data source, comprising:
-
intercepting display information transmitted to a computer-implemented display device; wherein the display information is from the data source; wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device; inducing a grammar via statistical analysis of the intercepted display information; wherein inducing a grammar includes determining how to break up the particular visual content into component parts; wherein determining how to break up the particular visual content into component parts includes; identifying a plurality of tokens in the particular visual content; for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens; generating a parser corresponding to the induced grammar; and performing screen scraping using the generated parser to produce a sequence of return values representing the extracted information; wherein the method is performed by one or more computing devices. - View Dependent Claims (2, 3, 4, 5, 10)
-
-
6. An apparatus for extracting information from a computer-based data source, comprising:
-
means for intercepting display information transmitted to a computer-implemented display device representing information stored in a data source; wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device; and a computer processor executing a sequence of instructions configuring the computer processor as; a grammar inducer producing a representation of a hierarchical structure that underlies the display information, wherein the hierarchical structure produced by the grammar inducer is described by a regular language; wherein the grammar inducer determines how to break up the particular visual content into component parts; wherein the grammar inducer determines how to break up the particular visual content into component parts by; identifying a plurality of tokens in the particular visual content; for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens; a parser-generator receiving the representation and configured to generate a parser corresponding thereto; and a screen scraper configured to extract the information from the intercepted display information using the generated parser. - View Dependent Claims (7, 8, 9)
-
-
11. A non-transitory computer-readable storage medium storing instructions for extracting information from a data source, wherein the instructions include instructions which, when executed by one or more processors, cause the one or more processors to perform a method comprising the steps of:
-
intercepting display information transmitted to a computer-implemented display device; wherein the display information is from the data source; wherein the display information includes information to cause particular visual content to be displayed on the computer-implemented display device; inducing a grammar via statistical analysis of the intercepted display information; wherein inducing a grammar includes determining how to break up the particular visual content into component parts; wherein determining how to break up the particular visual content into component parts includes; identifying a plurality of tokens in the particular visual content; for each token of the plurality of tokens, determining a frequency at which the token appears within the display information from the data source; and determining how to break up the particular visual content into component parts based, at least in part, on the frequency determined for each token of the plurality of tokens; generating a parser corresponding to the induced grammar; and performing screen scraping using the generated parser to produce a sequence of return values representing the extracted information. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification