×

System and method for associating an extensible set of data with documents downloaded by a web crawler

  • US 6,351,755 B1
  • Filed: 11/02/1999
  • Issued: 02/26/2002
  • Est. Priority Date: 11/02/1999
  • Status: Expired due to Term
First Claim
Patent Images

1. A method of performing a continuous crawl for locating and downloading documents from among a plurality of host computers, comprising:

  • (a) obtaining at least one referring document set that includes addresses of one or more referred documents;

    each referred document address including a host component;

    (b) enqueuing queue elements in a plurality of queues, each queue element denoting one of the referred document addresses;

    each queue element including a download history comprising zero or more records;

    (c) substantially concurrently operating a plurality of threads;

    (d) while operating each thread, repeatedly performing steps of;

    (d1) identifying a queue element in a selected one of the queues, downloading a referred document corresponding to a referred document address in the identified queue element, and dequeuing the identified queue element;

    (d2) adding a record to the queue element;

    (d3) executing at least one application program, distinct from a web crawler application that performs the downloading and dequeuing, for processing the downloaded document, the at least one application program including instructions that store name/value pairs in the record added to the queue element, wherein the name of each name/value pair is specified by the at least one application program and the value of each name/value is determined by the at least one application program; and

    (d4) storing the queue element, including the added record, in a predefined data structure for further processing.

View all claims
  • 7 Assignments
Timeline View
Assignment View
    ×
    ×