Search engine with multiple crawlers sharing cookies
First Claim
1. A web crawler system, comprising:
- a plurality of network crawlers each including, one or more processors and memory storing one or more modules to be executed by the one or more processors, the one or more modules having instructions for fetching documents from hosts on a network; and
a cookie database shared by the plurality of network crawlers, the cookie database storing cookies and associated information for use by the plurality of network crawlers;
wherein each network crawler of the plurality of network crawlers further includes instructions for retrieving one or more cookies from the cookie database so as to enable access to documents on at least one of the hosts on the network and each of the network crawlers includes instructions for detecting any of a plurality of predefined cookie errors associated with fetching a document by comparing a fetched document with a plurality of predefined cookie error patterns; and
wherein the cookie database includes cookie acquisition information corresponding to each of at least a plurality of the cookies in the cookie database;
the cookie acquisition information for a respective cookie enabling a respective network crawler to acquire the cookie from an acquisition URL specified by the cookie acquisition information;
wherein the acquisition URL is distinct from a target URL to be accessed using the respective cookie.
2 Assignments
0 Petitions
Accused Products
Abstract
A web-crawler system includes a plurality of network crawlers configured to fetch documents from hosts on a network and a cookie database shared by the plurality of network crawlers. The cookie database stores cookies and associated information for use by the plurality of network crawlers. Each network crawler is configured to retrieve one or more cookies from the cookie database so as to enable access to documents on at least one of the hosts on the network. In some embodiments, each of the network crawlers may be configured to detect any of a plurality of predefined cookie errors associated with fetching a document. In some embodiments, each of the network crawlers may also be configured to detect when a cookie in the cookie database has expired and to obtain a replacement cookie.
-
Citations
11 Claims
-
1. A web crawler system, comprising:
-
a plurality of network crawlers each including, one or more processors and memory storing one or more modules to be executed by the one or more processors, the one or more modules having instructions for fetching documents from hosts on a network; and a cookie database shared by the plurality of network crawlers, the cookie database storing cookies and associated information for use by the plurality of network crawlers; wherein each network crawler of the plurality of network crawlers further includes instructions for retrieving one or more cookies from the cookie database so as to enable access to documents on at least one of the hosts on the network and each of the network crawlers includes instructions for detecting any of a plurality of predefined cookie errors associated with fetching a document by comparing a fetched document with a plurality of predefined cookie error patterns; and wherein the cookie database includes cookie acquisition information corresponding to each of at least a plurality of the cookies in the cookie database;
the cookie acquisition information for a respective cookie enabling a respective network crawler to acquire the cookie from an acquisition URL specified by the cookie acquisition information;
wherein the acquisition URL is distinct from a target URL to be accessed using the respective cookie. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of crawling documents on a network, comprising:
-
providing a plurality of network crawlers configured to fetch documents from hosts on the network; and at each of the network crawlers, retrieving a respective cookie for a respective host from a shared cookie database that is shared by the plurality of network crawlers, so as to enable the plurality of network crawlers to have access to one or more documents on the respective host; wherein the cookie database includes cookie acquisition information corresponding to each of at least a plurality of the cookies in the cookie database;
the cookie acquisition information for a respective cookie including an acquisition URL;the method including, at a respective network crawler, acquiring a respective cookie from the acquisition URL specified for the cookie in the cookie database, and then accessing a respective target URL from a host on the network, the respective target URL corresponding to the acquired cookie; and at each of the network crawlers, detecting any of a plurality of predefined cookie errors by comparing a fetched document with a plurality of predefined cookie error patterns. - View Dependent Claims (8, 9, 10)
-
-
11. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising a network crawler module to be executed by a plurality of computer network crawlers in parallel, the network crawler module including instructions for:
-
retrieving a respective cookie for a respective host from a shared cookie database that is shared by the plurality of network crawlers, so as to enable the plurality of network crawlers to have access to one or more documents on the respective host; and retrieving a document from the respective host, including sending the respective cookie to the respective host; wherein the shared cookie database includes cookie acquisition information corresponding to each of at least a plurality of the cookies in the cookie database;
the cookie acquisition information for a respective cookie including an acquisition URL;the network crawler module further including instructions for acquiring a respective cookie from the acquisition URL specified for the cookie in the cookie database, and then accessing a respective target URL from a host on the network, the respective target URL corresponding to the acquired cookie; and instructions for detecting any of a plurality of predefined cookie errors by comparing a fetched document with a plurality of predefined cookie error patterns.
-
Specification