System and method for associating an extensible set of data with documents downloaded by a web crawler

US 6,351,755 B1
Filed: 11/02/1999
Issued: 02/26/2002
Est. Priority Date: 11/02/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:

(a) obtaining at least one referring document set that includes addresses of one or more referred documents;

each referred document address including a host component;

(b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;

each queue element including a download history comprising zero or more records;

(c) substantially concurrently operating a plurality of threads;

(d) while operating each thread, repeatedly performing steps of;

(d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;

(d2) adding a record to the queue element;

(d3) executing at least one application program, distinct from a web crawler application that performs the downloading and dequeuing, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and

(d4) storing the queue element, including the added record, in a predefined data structure for further processing.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A web crawler downloads documents from among a plurality of host computers. The web crawler enqueues document addresses in a data structure called the Frontier. The Frontier generally includes a set of queues, with all document addresses sharing a respective common host component being stored in a respective common one of the queues. Multiple threads substantially concurrently process the document addresses in the queues. The web crawler includes a set of tools for storing an extensible set of data with each document address in the Frontier. These tools enable the applications to which the web crawler passes downloaded documents to store a record of information associated with each download, where each record of information includes an extensible set of name/value pairs specified by the applications. The applications also determine how many records of information to retain for each document, when to delete records of information, and so on. In another aspect of the present invention, the Frontier include a set of parallel “priority queues,” each associated with a distinct priority level. Queue elements for documents to be downloaded are assigned a priority level, and then stored in the corresponding priority queue. Queue elements are then distributed from the priority queues to a set of underlying queues in accordance with their relative priorities. The threads then process the queue elements in the underlying queues.

257 Citations

29 Claims

1. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:
- (a) obtaining at least one referring document set that includes addresses of one or more referred documents;
  
  each referred document address including a host component;
  
  (b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;
  
  each queue element including a download history comprising zero or more records;
  
  (c) substantially concurrently operating a plurality of threads;
  
  (d) while operating each thread, repeatedly performing steps of;
  
  (d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;
  
  (d2) adding a record to the queue element;
  
  (d3) executing at least one application program, distinct from a web crawler application that performs the downloading and dequeuing, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and
  
  (d4) storing the queue element, including the added record, in a predefined data structure for further processing.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the at least one application program includes instructions for deleting a subset of the records in the queue element in accordance with predefined record deletion criteria.
  - 3. The method of claim 2, wherein the at least one application program includes instructions for determining a number corresponding to how many records are in the queue element, and the predefined record deletion criteria include the number of records in the queue element exceeding a threshold value.
  - 4. The method of claim 1, wherein the at least one application program includes instructions for reading the name/value pairs in at least one of the queue element and for conditionally performing an action based on the value in at least one of the name/value pairs read by the at least one application program.
  - 5. The method of claim 1, wherein
6. The method of claim 5, whereinthe name/value pairs stored in each of the records in the queue element include at least one content based value which can be compared with a corresponding content based value in another of the records to determine whether the document'"'"'s content changed between the downloads of the document corresponding to the records in which the content based values are stored;
- step d4 includes determining a download priority level for the document associated with the queue element as a function of whether the content based value in a last one of the records in the queue element is not equal to the corresponding content based value in an earlier one of the records in the queue element.
7. The method of claim 6, wherein the content base value is a checksum of the contents of the document corresponding to the queue element.
8. The method of claim 5, whereinthe name/value pairs stored in each of the records in the queue element include a purported expiration date and time;
- and step d4 includes comparing the purported expiration date and time with at least one other date and time value and assigning the queue element a download priority level in accordance with an outcome of the comparison.
9. The method of claim 1, wherein the name/value pairs stored in each of the records by the at least one application program are dynamically extensible by the at least one application program.

10. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
- an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;
  
  each queue element including a download history comprising zero or more records; and
  
  a dequeuing module that is substantially concurrently executed by each of a plurality of threads so as to process the referred document addresses in the queues;
  
  the dequeuing module, when executed by a respective one of the threads, repeatedly performs the functions of (a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;
  
  (a2) adding a record to the queue element;
  
  (a3) executing at least one application program, distinct from the enqueuing module and dequeuing module, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and
  
  (a4) storing the queue element, including the added record, in a predefined data structure for further processing.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 11. The computer program product of claim 10, wherein said enqueuing module is configured to use at least one of the downloaded referred documents as a new referring document.
  - 12. The computer program product of claim 10, wherein the at least one application program includes instructions for deleting a subset of the records in the queue element in accordance with predefined record deletion criteria.
  - 13. The computer program product of claim 10, wherein the at least one application program includes instructions for determining a number corresponding to how many records are in the queue element, and the predefined record deletion criteria include the number of records in the queue element exceeding a threshold value.
  - 14. The computer program product of claim 10, wherein the at least one application program includes instructions for reading the name/value pairs in at least one of the records in the queue element and for conditionally performing an action based on the value in at least one of the name/value pairs read by the at least one application program.
  - 15. The computer program product of claim 10, wherein
16. The computer program product of claim 15, whereinthe name/value pairs stored in each of the records in the queue element include at least one content based value which can be compared with a corresponding content based value in another of the records to determine whether the document'"'"'s content changed between the downloads of the document corresponding to the records in which the content based values are stored;
- and the computer program product includes instructions for determining a download priority level for the document associated with the queue element as a function of whether the content based value in a last one of the records in the queue element is not equal to the corresponding content based value in an earlier one of the records in the queue element.
17. The computer program product of claim 16, wherein the content base value is a checksum of the contents of the document corresponding to the queue element.
18. The computer program product of claim 15, whereinthe name/value pairs stored in each of the records in the queue element include a purported expiration date and time;
- and the computer program product includes instructions for comparing the purported expiration date and time with at least one other date and time value and assigning the queue element a download priority level in accordance with an outcome of the comparison.
19. The computer program product of claim 10, wherein the name/value pairs stored in each of the records by the at least one application program are dynamically extensible by the at least one application program.

20. A computer system for downloading documents from among a plurality of host computers, comprising:
- a plurality of threads of execution;
  
  an enqueuing module that, when executed by the computer system, obtains at least one referring document that includes addresses of one or more referred documents, each referred document address including a host component corresponding to a host computer, and enqueues queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;
  
  each queue element including a download history comprising zero or more records; and
  
  a dequeuing module that is substantially concurrently executed by each of the plurality of threads so as to process the referred document addresses in the queues;
  
  the dequeuing module, when executed by a respective one of the threads, repeatedly performs the functions of;
  
  (a1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;
  
  (a2) adding a record to the queue element;
  
  (a3) executing at least one application program, distinct from the enqueuing module and dequeuing module, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and
  
  (a4) storing the queue element, including the added record, in a predefined data structure for further processing.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 21. The computer system of claim 20, wherein said enqueuing module is configured to use at least one of the downloaded referred documents as a new referring document.
  - 22. The computer system of claim 20, wherein the at least one application program includes instructions for deleting a subset of the records in the queue element in accordance with predefined record deletion criteria.
  - 23. The computer system of claim 20, wherein the at least one application program includes instructions for determining a number corresponding to how many records are in the queue element, and the predefined record deletion criteria include the number of records in the queue element exceeding a threshold value.
  - 24. The computer system of claim 20, wherein the at least one application program includes instructions for reading the name/value pairs in at least one of the records in the queue element and for conditionally performing an action based on the value in at least one of the name/value pairs read by the at least one application program.
  - 25. The computer system of claim 20, wherein
26. The computer system of claim 25, whereinthe name/value pairs stored in each of the records in the queue element include at least one content based value which can be compared with a corresponding content based value in another of the records to determine whether the document'"'"'s content changed between the downloads of the document corresponding to the records in which the content based values are stored;
- and the at least one application program includes instructions for determining a download priority level for the document associated with the queue element as a function of whether the content based value in a last one of the records in the queue element is not equal to the corresponding content based value in an earlier one of the records in the queue element.
27. The computer system of claim 25, wherein the content base value is a checksum of the contents of the document corresponding to the queue element.
28. The computer system of claim 25, whereinthe name/value pairs stored in each of the records in the queue element include a purported expiration date and time;
- and the at least one application program includes instructions for comparing the purported expiration date and time with at least one other date and time value and assigning the queue element a download priority level in accordance with an outcome of the comparison.
29. The computer system of claim 20, wherein the name/value pairs stored in each of the records by at the least one application program are dynamically extensible by the at least one application program.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
AltaVista Company (Apollo Global Management, Inc.)
Inventors
Najork, Marc Alexander, Heydon, Clark Allan
Primary Examiner(s)
Hong, Stephen S.

Application Number

US09/433,006
Time in Patent Office

847 Days
Field of Search

707/3, 707/4, 707/5, 707/102, 707/501, 707/513
US Class Current

715/206
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

System and method for associating an extensible set of data with documents downloaded by a web crawler

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

257 Citations

29 Claims

Specification

Use Cases

Quick Links

Others

System and method for associating an extensible set of data with documents downloaded by a web crawler

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

257 Citations

29 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others