Method and apparatus for defining data of interest
First Claim
1. A method of extracting data of interest from at least one web site of a plurality of web sites, wherein the data of interest is information associated with a product, the method comprising:
- (A) for each respective web site W in said plurality of web sites,(i) creating a respective description of data of interest that identifies the web site W;
(ii) developing an extraction pattern from a web page output from the respective web site W using a graphical user interface tool, the extraction pattern being adapted to identify at least a portion of an output of a web site and to extract information from a plurality of web pages of the respective web site W, wherein the extraction pattern comprises a regular expression; and
(iii) associating the developed extraction pattern with the respective description of data of interest for the respective web site W;
(B) receiving a value for use as an extraction parameter for the developed extraction patterns; and
(C) obtaining said data of interest by querying the at least one web site of the plurality of web sites using the value and the extraction patterns associated with the respective descriptions of data of interest; and
(D) extracting said data of interest from the at least one web site of the plurality of web sites and storing said extracted data of interest.
8 Assignments
0 Petitions
Accused Products
Abstract
Some embodiments of the invention include tools for extracting data of interest from the World Wide Web (WWW) using descriptions of data of interest. The descriptions of data of interest can include computer programs comprising a sequence of instructions and extractor patterns. The extractor patterns can be developed interactively using a web browser integrated into the graphical development environment for creating the descriptions. The instructions can be selected from a predetermined list of instructions designed for extracting information from the WWW. The descriptions of data of interest can be grouped into categories sharing common query elements. Multiple descriptions in the same category can be executed simultaneously using the same query. The descriptions can be accessed by a client computer using a web browser to initiate a query. In some embodiments, the descriptions of data of interest are used to provide information about products available for sale over the WWW.
-
Citations
35 Claims
-
1. A method of extracting data of interest from at least one web site of a plurality of web sites, wherein the data of interest is information associated with a product, the method comprising:
-
(A) for each respective web site W in said plurality of web sites, (i) creating a respective description of data of interest that identifies the web site W; (ii) developing an extraction pattern from a web page output from the respective web site W using a graphical user interface tool, the extraction pattern being adapted to identify at least a portion of an output of a web site and to extract information from a plurality of web pages of the respective web site W, wherein the extraction pattern comprises a regular expression; and (iii) associating the developed extraction pattern with the respective description of data of interest for the respective web site W; (B) receiving a value for use as an extraction parameter for the developed extraction patterns; and (C) obtaining said data of interest by querying the at least one web site of the plurality of web sites using the value and the extraction patterns associated with the respective descriptions of data of interest; and (D) extracting said data of interest from the at least one web site of the plurality of web sites and storing said extracted data of interest. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. An apparatus for extracting information of interest from at least one web site of a plurality of web sites, the apparatus comprising:
-
(A) for each respective web site W in the plurality of web sites, (i) means for creating a respective description of data of interest that identifies the web site W; (ii) means for developing an extraction pattern from a web page output from the respective web site W using a graphical user interface tool, the extraction pattern being adapted to extract data from a plurality of web pages of the web site, wherein the extraction pattern comprises a regular expression; and (iii) means for associating the developed extraction pattern with the respective description of data of interest for the respective web site W; (B) means for receiving a value for use as an extraction parameter in the developed extraction patterns; and (C) means for obtaining said data of interest by querying the at least one web site of the plurality of web sites using the value and the developed extraction patterns associated with the respective descriptions of data of interest; and (D) means for extracting said data of interest from the at least one web site of the plurality of web sites and storing said extracted data of interest. - View Dependent Claims (21, 22, 23, 24)
-
-
25. A computer data signal embodied in a carrier wave comprising:
-
(A) a software module for creating a description of data of interest, the software module including; (i) a set of operations for interactively developing an extraction pattern from a web page output of a target web site using a graphical user interface tool, the developed extraction pattern being adapted to extract data of interest from a plurality of web pages of the target web site, wherein the extraction pattern comprises a regular expression; (ii) a set of operations for receiving a selection of an instruction from a predefined set of instructions for inclusion in the description of data of interest; (iii) a set of operations for associating the extraction pattern with the instruction; (iv) a set of operations for testing the instruction using the extraction pattern and the contents of a buffer, wherein the buffer includes a portion of the web page output of the web site associated with the description of data of interest; and (B) a software module for using the description of data of interest to obtain data of interest from the target web site when a value for use as an extraction parameter for the developed extraction pattern is provided. - View Dependent Claims (26)
-
-
27. A method of extracting data of interest from at least one of a plurality of web sites, wherein the data of interest is information associated with a product, the method comprising:
-
(A) for each respective web site W in said plurality of web sites, (i) creating a respective description of data of interest that identifies the web site W; (ii) developing an extraction pattern from a web page output from the respective web site W using a graphical user interface tool, the extraction pattern being adapted to identify at least a portion of an output of a web site and to extract information from a plurality of web pages of the respective web site W, wherein the extraction pattern comprises a pre-condition regular expression, a portion of data of interest regular expression, and a post-condition regular expression, said developing an extraction pattern comprising refining at least one of said pre-condition regular expression, said portion of data of interest regular expression, and said post-condition regular expression; and (iii) associating the developed extraction pattern with the respective description of data of interest for the respective web site W; (B) receiving a value for use as an extraction parameter for the developed extraction patterns; and (C) obtaining the data of interest by querying the at least one web site of the plurality of web sites using the value and the extraction patterns associated with the respective descriptions of data of interest; and (D) extracting said data of interest from the at least one web site of the plurality of web sites and storing said extracted data of interest. - View Dependent Claims (28, 29)
-
-
30. An apparatus for extracting information of interest from at least one of a plurality of web sites, the apparatus comprising:
-
(A) for each respective web site W in the plurality of web sites, (i) means for creating a respective description of data of interest that identifies the web site W; (ii) means for developing an extraction pattern from a web page output from the respective web site W using a graphical user interface tool, the extraction pattern being adapted to extract data from a plurality of web pages of the web site, wherein said extraction pattern comprises a pre-condition regular expression, a portion of data of interest regular expression, and a post-condition regular expression, said means for developing comprising refining at least one of said pre-condition regular expression, said portion of data of interest regular expression, and said post-condition regular expression; and (iii) means for associating the developed extraction pattern with the respective description of data of interest for the respective web site W; (B) means for receiving a value for use as an extraction parameter in the developed extraction patterns; and (C) means for obtaining said data of interest by querying the at least one web site of the plurality of web sites using the value and the developed extraction patterns associated with the respective descriptions of data of interest, (D) means for extracting said data of interest from the at least one web site of the plurality of web sites and storing said extracted data of interest. - View Dependent Claims (31)
-
-
32. A computer data signal embodied in a carrier wave comprising:
-
(A) a software module for creating a description of data of interest, the software module including; (i) a set of operations for interactively developing an extraction pattern from a web page output of a target web site using a graphical user interface tool, the developed extraction pattern being adapted to extract data of interest from a plurality of web pages of the target web site, wherein said extraction pattern comprises a pre-condition regular expression, a portion of data of interest regular expression, and a post-condition regular expression, said operations for interactively developing comprising refining at least one of said pre-condition regular expression, said portion of data of interest regular expression, and said post-condition regular expression; (ii) a set of operations for receiving a selection of an instruction from a predefined set of instructions for inclusion in the description of data of interest; (iii) a set of operations for associating the extraction pattern with the instruction; (iv) a set of operations for testing the instruction using the extraction pattern and the contents of a buffer, wherein the buffer includes a portion of the web page output of the web site associated with the description of data of interest; and (B) a software module for using the description of data of interest to obtain data of interest from the target web site when a value for use as an extraction parameter for the developed extraction pattern is provided.
-
-
33. A computer implemented method of obtaining data of interest from at least one web site of a plurality of web sites comprising:
-
(A) developing a description of data of interest for each web site in said plurality of web sites from web page output from the plurality of web sites using a graphical user interface tool that includes a web browser, each respective description of data of interest specifying an address for a corresponding web site in the plurality of web sites and each respective description of data of interest including an extraction pattern adapted to identify at least a portion of the output of a web site and to extract user specified information from a plurality of web pages of the corresponding web site, wherein the extraction pattern comprises a regular expression; (B) receiving a value for use as an extraction parameter for the developed extraction patterns; and (C) obtaining said data of interest by querying the at least one web site of the plurality of web sites using the value and the extraction patterns in the respective descriptions of data of interest. - View Dependent Claims (34)
-
-
35. A computer implemented method of obtaining data of interest from at least one web site of a plurality of web sites comprising:
-
(A) developing a description of data of interest for each web site in said plurality of web sites from web page output from the plurality of web sites using a graphical user interface tool that includes a web browser, each respective description of data of interest specifying an address for a corresponding web site in the plurality of web sites and each respective description of data of interest including an extraction pattern adapted to identify at least a portion of the output of a web site and to extract user specified information from a plurality of web pages of the corresponding web site, wherein each said extraction pattern comprises a pre-condition regular expression, a portion of data of interest regular expression, and a post-condition regular expression, said developing a description of data of interest comprising refining at least one of said pre-condition regular expression, said portion of data of interest regular expression, and said post-condition regular expression; (B) receiving a value for use as an extraction parameter for the developed extraction patterns; and (C) obtaining said data of interest by querying the at least one web site of the plurality of web sites using the value and the extraction patterns in the respective descriptions of data of interest.
-
Specification