System and method for enforcing politeness while scheduling downloads in a web crawler
First Claim
1. A method of downloading data sets from among a plurality of host computers, comprising:
- (a) obtaining at least one referring data set that includes addresses of one or more referred data sets;
each referred data set address including a host address;
(b) enqueuing the referred data set addresses in a plurality of queues, including enqueuing those of the referred data set addresses sharing a respective common host address into a respective common one of the queues;
(c) assigning a next download time to each of the queues that has enqueued therein at least one referred data set address;
(d) substantially concurrently operating a plurality of threads, wherein the number of queues is at least as great as the number of threads;
(e) while operating each thread, repeatedly performing steps of;
(e1) selecting one of the queues not selected by any of the other threads, in accordance with the next download times assigned to the queues not selected by any of the other threads;
(e2) downloading a referred data set corresponding to a referred data set address in the selected queue, processing the downloaded referred data set, dequeuing the referred data set address from the selected queue;
(e3) when the selected queue is not empty after the dequeuing step, assigning an updated next download time to the selected queue; and
(e4) deselecting the selected queue;
wherein the enqueuing of referred data set addresses sharing a respective common host address to a respective common one of the queues in step (b) ensures that the downloading in step (e2) by the plurality of threads does not simultaneously download more than one referred data set from any of the host computers.
7 Assignments
0 Petitions
Accused Products
Abstract
A web crawler downloads data sets from among a plurality of host computers. The web crawler enqueues data set addresses in a set of queues, with all data set addresses sharing a respective common host address being stored in a respective common one of the queues. Each non-empty queue is assigned a next download time. Multiple threads substantially concurrently process the data set addresses in the queues. The number of queues is at least as great as the number of threads, and the threads are dynamically assigned to the queues. In particular, each thread selects a queue not being serviced by any of the other threads. The queue is selected in accordance with the next download times assigned to the queues. The data set corresponding to a data set address in the selected queue is downloaded and processed, and the data set address is dequeued from the selected queue. When the selected queue is not empty after the dequeuing step, it is assigned an updated download time. Then the thread deselects the selected queue, and the process of selecting a queue and processing a data set repeats. The next download time assigned to each queue is preferably a function of the length of time it took to download a previous document whose address was stored in the queue. For instance, the next download time may be set equal to the current time plus the a scaling constant multiplied by the download time of the previous document.
143 Citations
42 Claims
-
1. A method of downloading data sets from among a plurality of host computers, comprising:
-
(a) obtaining at least one referring data set that includes addresses of one or more referred data sets;
each referred data set address including a host address;
(b) enqueuing the referred data set addresses in a plurality of queues, including enqueuing those of the referred data set addresses sharing a respective common host address into a respective common one of the queues;
(c) assigning a next download time to each of the queues that has enqueued therein at least one referred data set address;
(d) substantially concurrently operating a plurality of threads, wherein the number of queues is at least as great as the number of threads;
(e) while operating each thread, repeatedly performing steps of;
(e1) selecting one of the queues not selected by any of the other threads, in accordance with the next download times assigned to the queues not selected by any of the other threads;
(e2) downloading a referred data set corresponding to a referred data set address in the selected queue, processing the downloaded referred data set, dequeuing the referred data set address from the selected queue;
(e3) when the selected queue is not empty after the dequeuing step, assigning an updated next download time to the selected queue; and
(e4) deselecting the selected queue;
wherein the enqueuing of referred data set addresses sharing a respective common host address to a respective common one of the queues in step (b) ensures that the downloading in step (e2) by the plurality of threads does not simultaneously download more than one referred data set from any of the host computers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
(i) using at least one of the downloaded referred data sets of step e2 as a new referring data set for step a; and
(ii) repeating steps a through e.
-
-
4. The method of claim 1, wherein the data sets include web pages and the data set addresses include uniform resource locators.
-
5. The method of claim 1, wherein each of the queues is a first-in-first-out queue.
-
6. The method of claim 1, wherein
the queues not selected by any of the other threads are stored as an ordered set, ordered with respect to the next download times assigned to the queues in the ordered set; -
said stepe 1 includes removing the selected queue from the ordered set; and
said step e4 includes returning the selected queue to the ordered set.
-
-
7. The method of claim 6, including delaying the queue selection step when the next download times assigned to all the queues in the ordered set are later than a current time.
-
8. The method of claim 1, including delaying the queue selection step when the next download times assigned to all the queues in the ordered set are later than a current time.
-
9. The method of claim 1, wherein
step e2 includes determining a download time for the downloading of the referred data set; the updated next download time assigned by step e3 to the selected queue is a function of the determined download time.
-
10. The method of claim 9, wherein the updated next download time assigned by step e3 is equal to a current time plus a scaling constant multiplied by the determined download time.
-
11. The method of claim 1, wherein said step (b) of enqueuing the referred data set addresses includes:
-
(b1) calculating a fingerprint for each referred data set address based on at least part of the host address included in the referred data set address; and
(b2) allocating the address to one of the queues based on the fingerprint.
-
-
12. The method of claim 11, wherein:
-
(i) the plurality of queues comprises N queues, each of the queues having an associated numerical identifier; and
(ii) step (b2) includes assigning each referred data set address to the queue having a numerical identifier equal to the referred data set address fingerprint modulo N.
-
-
13. The method of claim 1, where step (b) includes:
-
(b1) enqueuing the referred data set addresses into a main queue;
(b2) assigning a host to each of said plurality of queues;
(b3) enqueuing said referred data set addresses from said main queue into said queues according to said assignment; and
(b4) assigning a new host any one of said plurality of queues when said one queue becomes empty.
-
-
14. The method of claim 1, wherein there are at least twice as many queues as threads.
-
15. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
-
an enqueuing module that, when executed by the computer system, obtains at least one referring data set that includes addresses of one or more referred data sets, each referred data set address including a host address corresponding to a host computer, and enqueues the referred data set addresses in a plurality of queues, including enqueuing those of the referred data set addresses sharing a respective common host address into a respective common one of the queues;
a thread module for launching execution of a plurality of threads, wherein there are at least as many queues as threads;
a dequeuing module that is substantially concurrently executed by each of the plurality of threads so as to sequentially processes the referred data set addresses in the queues;
the dequeuing module, when executed by a respective one of the threads, repeatedly performs the functions of(a1) selecting one of the queues not selected by any of the other threads;
(a2) downloading a referred data set corresponding to a referred data set address in the selected queue, processing the downloaded referred data set, dequeuing the referred data set address from the selected queue;
(a3) when the selected queue is not empty after the dequeuing step, assigning an updated next download time to the selected queue; and
(a4) deselecting the selected queue;
wherein the dequeuing module selects a queue in accordance with the next download times assigned to the queues not selected by any of the other threads;
wherein the enqueuing module enqueues all referred data set addresses sharing a respective common host address to a respective common one of the queues, and the dequeuing module downloads at most one referred data set from any one host computer at any one time. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
the dequeuing module includes instructions for determining a download time for the downloading of the referred data set; - and
the updated next download time assigned by the dequeuing module to the selected queue is a function of the determined download time.
-
-
24. The computer program product of claim 23, wherein the updated next download time assigned by the dequeuing module is equal to a current time plus a scaling constant multiplied by the determined download time.
-
25. The computer program product of claim 15, wherein said enqueuing module includes instructions for calculating a fingerprint for each referred data set address based on at least part of the host address included in the referred data set address, and allocating the address to one of the queues based on the fingerprint.
-
26. The computer program product of claim 15, wherein said enqueuing module addresses includes instructions for calculating a fingerprint for each referred data set address based on at least part of the host address included in the referred data set address, and instructions for allocating the address to one of the queues based on the fingerprint.
-
27. The computer program product of claim 26, wherein:
- the plurality of queues comprises N queues, each of the queues having an associated numerical identifier; and
the instructions for allocating assign each referred data set address to the queue having a numerical identifier equal to the referred data set address fingerprint modulo N.
- the plurality of queues comprises N queues, each of the queues having an associated numerical identifier; and
-
28. The computer program product of claim 15, wherein there are at least twice as many queues as threads.
-
29. A web crawler for downloading data sets from among a plurality of host computers, comprising:
-
a plurality of threads of execution;
an enqueuing module, executed by each of the plurality of threads, that obtains at least one referring data set that includes addresses of one or more referred data sets, each referred data set address including a host address corresponding to a host computer, and enqueues the referred data set addresses in a plurality of queues, including enqueuing those of the referred data set addresses sharing a respective common host address into a respective common one of the queues; and
a dequeuing module that is substantially concurrently executed by each of the plurality of threads so as to sequentially processes the referred data set addresses in the queues;
the dequeuing module, when executed by a respective one of the threads, repeatedly performs the functions of(a1) selecting one of the queues not selected by any of the other threads;
(a2) downloading a referred data set corresponding to a referred data set address in the selected queue, processing the downloaded referred data set, dequeuing the referred data set address from the selected queue;
(a3) when the selected queue is not empty after the dequeuing step, assigning an updated download time to the selected queue; and
(a4) deselecting the selected queue;
wherein the dequeuing module selects a queue in accordance with the next download times assigned to the queues not selected by any of the other threads;
wherein the enqueuing module enqueues all referred data set addresses sharing a respective common host address to a respective common one of the queues, and the dequeuing module downloads at most one referred data set from any one host computer at any one time. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42)
the dequeuing module includes instructions for determining a download time for the downloading of the referred data set; - and
the updated next download time assigned by the dequeuing module to the selected queue is a function of the determined download time.
-
-
38. The web crawler of claim 37, wherein the updated next download time assigned by dequeuing module is equal to a current time plus a scaling constant multiplied by the determined download time.
-
39. The web crawler of claim 29, wherein said enqueuing module includes instructions for calculating a fingerprint for each referred data set address based on at least part of the host address included in the referred data set address, and allocating the address to one of the queues based on the fingerprint.
-
40. The web crawler of claim 29, wherein said enqueuing module addresses includes instructions for calculating a fingerprint for each referred data set address based on at least part of the host address included in the referred data set address, and instructions for allocating the address to one of the queues based on the fingerprint.
-
41. The web crawler of claim 40, wherein:
- the plurality of queues comprises N queues, each of the queues having an associated numerical identifier; and
the instructions for allocating assign each referred data set address to the queue having a numerical identifier equal to the referred data set address fingerprint modulo N.
- the plurality of queues comprises N queues, each of the queues having an associated numerical identifier; and
-
42. The web crawler of claim 29, wherein there are at least twice as many queues as threads.
Specification