Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness

US 6,263,364 B1
Filed: 11/02/1999
Issued: 07/17/2001
Est. Priority Date: 11/02/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:

(a) obtaining at least one referring document set that includes addresses of one or more referred documents;

each referred document address including a host component;

(b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;

(c) substantially concurrently operating a plurality of threads;

(d) while operating each thread, repeatedly performing steps of;

(d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element; and

(d2) executing at least one application program for processing the downloaded document;

the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and

step (d) including determining a download priority level for maintaining freshness of the downloaded document and re-enqueuing the queue element for the downloaded document in one of the parallel priority level queues in accordance with the determined download priority level.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The Frontier includes a set of parallel “priority queues,” each associated with a distinct priority level. Queue elements for documents to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues. When performing a continuous crawl, the web crawler reinserts the queue element for a downloaded document into the Frontier in accordance with a download priority level associated with the downloaded document. For example, the download priority level may be determined as a function of an expiration date and time associated with document whose document address is denoted by the queue element.

Citations

30 Claims

1. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:
- (a) obtaining at least one referring document set that includes addresses of one or more referred documents;
  
  each referred document address including a host component;
  
  (b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;
  
  (c) substantially concurrently operating a plurality of threads;
  
  (d) while operating each thread, repeatedly performing steps of;
  
  (d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element; and
  
  (d2) executing at least one application program for processing the downloaded document;
  
  the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and
  
  step (d) including determining a download priority level for maintaining freshness of the downloaded document and re-enqueuing the queue element for the downloaded document in one of the parallel priority level queues in accordance with the determined download priority level.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein said enqueuing module is configured to use at least one of the downloaded referred documents as a new referring document.
  - 3. The method of claim 1, wherein
4. The method of claim 3, wherein step (b) includes enqueuing those of the referred data set addresses sharing a respective common host address into a respective common one of the underlying queues.
5. The method of claim 4, wherein referred data sets corresponding to referred data set addresses from different ones of the underlying queues are downloaded substantially concurrently, while referred data sets corresponding to referred data set addresses from any single one of the underlying queues are downloaded one at a time.
6. The method of claim 1, wherein the download priority level for each of a subset of the queue elements is determined as a function of an expiration date and time associated with document whose document address is denoted by the queue element.
7. The method of claim 1, wherein the download priority level for each of a subset of the queue elements is determined as a function of a host component of the document address denoted by the queue element.
8. The method of claim 1, wherein the download priority level for each of a subset of the queue elements is determined as a function of a historical rate of change of the document whose address is denoted by the queue element.

9. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:
- (a) obtaining at least one referring document set that includes addresses of one or more referred documents;
  
  each referred document address including a host component;
  
  (b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;
  
  (c) substantially concurrently operating a plurality of threads;
  
  (d) while operating each thread, repeatedly performing steps of;
  
  (d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element; and
  
  (d2) executing at least one application program for processing the downloaded document;
  
  the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and
  
  step (b) including determining a download priority level for each queue element and enqueuing the queue element in one of the parallel priority level queues in accordance with the determined download priority level.
- View Dependent Claims (10)
- - 10. The method of claim 9, wherein the download priority level for each of a subset of the queue elements is determined as a function of the referred document address denoted by the queue element.

11. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
- an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in a plurality of queues, each queue element denoting one of the referred document addresses; and
  
  a dequeuing module that is substantially concurrently executed by each of a plurality of threads so as to process the referred document addresses in the queues;
  
  the dequeuing module including instructions that, when executed by a respective one of the threads, repeatedly perform the functions of;
  
  (a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;
  
  (a2) executing at least one application program for processing the downloaded document; and
  
  (a3) determining a download priority level for maintaining freshness of the downloaded document;
  
  the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and
  
  the dequeuing module including instructions for re-enqueuing the queue element for the downloaded document in one of the parallel priority level queues in accordance with the download priority level determined for the queue element.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
- - 12. The computer program product of claim 11, wherein said enqueuing module is configured to use at least one of the downloaded referred documents as a new referring document.
  - 13. The computer program product of claim 11, wherein
14. The computer program product of claim 13, wherein the enqueuing module includes instructions for enqueuing those of the referred data set addresses sharing a respective common host address into a respective common one of the underlying queues.
15. The computer program product of claim 14, wherein referred data sets corresponding to referred data set addresses from different ones of the underlying queues are downloaded substantially concurrently, while referred data sets corresponding to referred data set addresses from any single one of the underlying queues are downloaded one at a time.
16. The computer program product of claim 11, wherein the download priority level for each of a subset of the queue elements is determined as a function of an expiration date and time associated with document whose document address is denoted by the queue element.
17. The computer program product of claim 11, wherein the download priority level for each of a subset of the queue elements is determined as a function of a host component of the document address denoted by the queue element.
18. The computer program product of claim 11, wherein the download priority level for each of a subset of the queue elements is determined as a function of a historical rate of change of the document whose address is denoted by the queue element.

19. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
- an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in a plurality of queues, each queue element denoting one of the referred document addresses; and
  
  a dequeuing module that is substantially concurrently executed by each of a plurality of threads so as to process the referred document addresses in the queues;
  
  the dequeuing module including instructions that, when executed by a respective one of the threads, repeatedly perform the functions of;
  
  (a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element; and
  
  (a2) executing at least one application program for processing the downloaded document;
  
  the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and
  
  the enqueuing module including instructions for determining a download priority level for each queue element and enqueuing the queue element in one of the parallel priority level queues in accordance with the determined download priority level.
- View Dependent Claims (20)
- - 20. The computer program product of claim 19, wherein the download priority level for each of a subset of the queue elements is determined as a function of the referred document address denoted by the queue element.

21. A web crawler for downloading documents from among a plurality of host computers, comprising:
- at least one central processing unit;
  
  a plurality of threads of execution that are executed by the at least one central processing unit;
  
  memory for storing a plurality of queues;
  
  an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in the plurality of queues; and
  
  a dequeuing module that is substantially concurrently executed by each of a plurality of threads so as to process the referred document addresses in the queues;
  
  the dequeuing module including instructions that, when executed by a respective one of the threads, repeatedly perform the functions of;
  
  (a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequening the identified queue element;
  
  (a2) executing at least one application program for processing the downloaded document; and
  
  (a3) determining a download priority level for maintaining freshness of the downloaded document;
  
  the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and
  
  the dequeuing module including instructions for re-enqueuing the queue element for the downloaded document in one of the parallel priority level queues in accordance with the download priority level determined for the queue element.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
- - 22. The web crawler of claim 21, wherein said enqueuing module is configured to use at least one of the downloaded referred documents as a new referring document.
  - 23. The web crawler of claim 21, wherein
24. The web crawler of claim 23, wherein the enqueuing module includes instructions for enqueuing those of the referred data set addresses sharing a respective common host address into a respective common one of the underlying queues.
25. The web crawler of claim 24, wherein referred data sets corresponding to referred data set addresses from different ones of the underlying queues are downloaded substantially concurrently, while referred data sets corresponding to referred data set addresses from any single one of the underlying queues are downloaded one at a time.
26. The web crawler of claim 21, wherein the download priority level for each of a subset of the queue elements is determined as a function of an expiration date and time associated with document whose document address is denoted by the queue element.
27. The web crawler of claim 21, wherein the download priority level for each of a subset of the queue elements is determined as a function of a host component of the document address denoted by the queue element.
28. The web crawler of claim 21, wherein the download priority level for each of a subset of the queue elements is determined as a function of a historical rate of change of the document whose address is denoted by the queue element.

29. A web crawler for downloading documents from among a plurality of host computers, comprising:
- at least one central processing unit;
  
  a plurality of threads of execution that are executed by the at least one central processing unit;
  
  memory for storing a plurality of queues;
  
  an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in the plurality of queues; and
  
  a dequeuing module that is substantially concurrently executed by each of a plurality of threads so as to process the referred document addresses in the queues;
  
  the dequeuing module including instructions that, when executed by a respective one of the threads, repeatedly perform the functions of;
  
  (a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element; and
  
  (a2) executing at least one application program for processing the downloaded document;
  
  the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and
  
  the enqueuing module including instructions for determining a download priority level for each queue element and enqueuing the queue element in one of the parallel priority level queues in accordance with the determined download priority level.
- View Dependent Claims (30)
- - 30. The web crawler of claim 29, wherein the download priority level for each of a subset of the queue elements is determined as a function of the referred document address denoted by the queue element.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
AltaVista Company (Apollo Global Management, Inc.)
Inventors
Wiener, Janet Lynn, Najork, Marc Alexander, Heydon, Clark Allan
Primary Examiner(s)
Luu, Le Hien

Application Number

US09/433,007
Time in Patent Office

623 Days
Field of Search

709/217, 709/201, 709/226, 709/223, 709/225, 710/5, 707/10, 707/3, 707/104
US Class Current

709/217
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links