×

Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness

  • US 6,263,364 B1
  • Filed: 11/02/1999
  • Issued: 07/17/2001
  • Est. Priority Date: 11/02/1999
  • Status: Expired due to Term
First Claim
Patent Images

1. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:

  • (a) obtaining at least one referring document set that includes addresses of one or more referred documents;

    each referred document address including a host component;

    (b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;

    (c) substantially concurrently operating a plurality of threads;

    (d) while operating each thread, repeatedly performing steps of;

    (d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element; and

    (d2) executing at least one application program for processing the downloaded document;

    the plurality of queues including a plurality of parallel priority level queues, each having a distinct associated download priority level, the download priority level corresponding to a probability of the queue elements enqueued in the associated priority level queue therein being processed by the threads; and

    step (d) including determining a download priority level for maintaining freshness of the downloaded document and re-enqueuing the queue element for the downloaded document in one of the parallel priority level queues in accordance with the determined download priority level.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×