×

Scheduler for search engine crawler

  • US 8,042,112 B1
  • Filed: 06/30/2004
  • Issued: 10/18/2011
  • Est. Priority Date: 07/03/2003
  • Status: Active Grant
First Claim
Patent Images

1. A scheduler system for a search engine crawler, comprising:

  • a memory for storing a set of document identifiers corresponding to documents on a network and associated status data collected during one or more previous crawls by the search engine crawler; and

    a plurality of schedulers configured to select a subset of the document identifiers for crawling, at least some of the schedulers configured to compute priority scores for the subset of document identifiers and to schedule for crawling at least a portion of the subset of document identifiers based on the priority scores and status data;

    wherein the scheduling includes removing from the subset one or more document identifiers that were unreachable in a plurality of consecutive prior crawls or had download errors in a plurality of consecutive prior crawls; and

    wherein at least some of the schedulers are configured to;

    select one or more document identifiers whose corresponding documents will be reused based on historical status data; and

    retrieve and index documents corresponding to the selected document identifiers, wherein the documents are retrieved from a repository distinct from web hosts corresponding to the document identifiers.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×