Method, apparatus and computer program product to crawl a web site
First Claim
1. A method for crawling a web site, the method comprising the steps of:
- a) querying a web site server by a crawler program, wherein at least one page of the web site has a reference for executing by a browser to produce an address for a next page;
b) parsing such a reference from one of the web pages by the crawler program and sending the reference to an applet running in the browser; and
c) determining the address for the next page by the browser responsive to the reference and sending the address to the crawler.
1 Assignment
0 Petitions
Accused Products
Abstract
In one embodiment, an improved method for crawling a web site is provided. At least one page of the web site has a reference for executing by a browser to produce an address for a next page. The web site is crawled by the crawler program, which includes querying the web site server. The crawler parses such a reference from one of the web pages, and sends the reference to an applet running in the browser. The address for the next page is determined by the browser responsive to the reference. The address is then sent to the crawler. In an application of the improved crawler, the crawler is used for reducing dynamic data generation on the web site server. In this application, at least some of the web pages are dynamically generated responsive to the crawler queries. The server generated web pages are processed to generate corresponding processed versions of the web pages, so that the processed versions can be served in response to future queries, reducing dynamic generation of web pages by the server.
101 Citations
24 Claims
-
1. A method for crawling a web site, the method comprising the steps of:
-
a) querying a web site server by a crawler program, wherein at least one page of the web site has a reference for executing by a browser to produce an address for a next page;
b) parsing such a reference from one of the web pages by the crawler program and sending the reference to an applet running in the browser; and
c) determining the address for the next page by the browser responsive to the reference and sending the address to the crawler. - View Dependent Claims (4, 5, 6, 13)
-
- 2. The method of claim 2, the browser being configured to use a certain proxy, and refer to a resolver file for hostname-to-IP-address-resolution, and wherein the web site server has an IP address, the proxy for the browser has a certain IP address, and the resolver file indicates the certain IP address as the IP address for the web site server.
-
7. A method for reducing dynamic data generation on a web site server, the method comprising the steps of:
-
a) querying a web site server by a crawler program responsive to references from one web page to another in the web site, wherein the queries are for causing the server to generate web pages, at least some of the web pages being dynamically generated; and
b) processing the server generated web pages to generate corresponding processed versions of the web pages, so that the processed versions can be served in response to future queries, reducing dynamic generation of web pages by the server. - View Dependent Claims (8)
-
-
9. A computer program product for crawling a web site, the computer program product comprising:
-
a) first instructions for querying a web site server by a crawler program, wherein at least one page of the web site has a reference for executing by a browser to produce an address for a next page;
b) second instructions for parsing such a reference from one of the web pages by the crawler program and sending the reference to an applet running in the browser; and
c) third instructions for determining the address for the next page by the browser responsive to the reference and sending the address to the crawler. - View Dependent Claims (10, 11, 12, 14, 16, 18, 19, 20, 21, 22, 24)
-
-
15. A computer program product for reducing dynamic data generation on a web site server, the computer program product comprising:
-
first instructions for querying a web site server by a crawler program responsive to references from one web page to another in the web site, wherein the queries are for causing the server to generate web pages, at least some of the web pages being dynamically generated; and
second instructions for processing the server generated web pages to generate corresponding processed versions of the web pages, so that the processed versions can be served in response to future queries, reducing dynamic generation of web pages by the server.
-
-
17. An apparatus for for crawling a web site, the apparatus comprising:
-
a processor connected a network, a storage device connected to the processor and the network, wherein the storage device is for storing a program for controlling the processor, and wherein the processor is operative with the program to execute a crawler program and a browser program for performing the steps of;
a) querying a web site server by the crawler, wherein at least one page of the web site has a reference for executing by the browser to produce an address for a next page;
b) parsing such a reference from one of the web pages and sending the reference to an applet running in the browser; and
c) determining the address for the next page by the browser responsive to the reference and sending the address to the crawler.
-
-
23. An apparatus for reducing dynamic data generation on a web site server, the apparatus comprising:
-
a processor connected a network, a storage device connected to the processor and the network, wherein the storage device is for storing a program for controlling the processor, and wherein the processor is operative with the program to execute a crawler program and a browser program for performing the steps of;
a) querying a web site server by the crawler responsive to references from one web page to another in the web site, wherein the queries are for causing the server to generate web pages, at least some of the web pages being dynamically generated; and
b) processing the server generated web pages to generate corresponding processed versions of the web pages, so that the processed versions can be served in response to future queries, reducing dynamic generation of web pages by the server.
-
Specification