Scheduler for search engine crawler
First Claim
1. A scheduler system for a search engine crawler, comprising:
- a memory for storing a set of document identifiers corresponding to documents on a network and associated status data collected during one or more previous crawls by the search engine crawler; and
a plurality of schedulers configured to select a subset of the document identifiers for crawling, at least some of the schedulers configured to compute priority scores for the subset of document identifiers and to schedule for crawling at least a portion of the subset of document identifiers based on the priority scores and status data;
wherein the scheduling includes removing from the subset one or more document identifiers that were unreachable in a plurality of consecutive prior crawls or had download errors in a plurality of consecutive prior crawls; and
wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier.
1 Assignment
0 Petitions
Accused Products
Abstract
A search engine crawler includes a distributed set of schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Each scheduler handles the scheduling of document identifiers (for crawling) for a subset of the known document identifiers. Using a starting set of document identifiers, such as the document identifiers crawled (or scheduled for crawling) during the most recent completed crawl, the scheduler removes from the starting set those document identifiers that have been unreachable in each of the last X crawls. Other filtering mechanisms may also be used to filter out some of the document identifiers in the starting set. The resulting list of document identifiers is written to a scheduled output file for use in a next crawl cycle.
101 Citations
15 Claims
-
1. A scheduler system for a search engine crawler, comprising:
-
a memory for storing a set of document identifiers corresponding to documents on a network and associated status data collected during one or more previous crawls by the search engine crawler; and a plurality of schedulers configured to select a subset of the document identifiers for crawling, at least some of the schedulers configured to compute priority scores for the subset of document identifiers and to schedule for crawling at least a portion of the subset of document identifiers based on the priority scores and status data; wherein the scheduling includes removing from the subset one or more document identifiers that were unreachable in a plurality of consecutive prior crawls or had download errors in a plurality of consecutive prior crawls; and wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier. - View Dependent Claims (2, 3, 4)
-
-
5. A method of scheduling a search engine crawler using a scheduler system, comprising:
-
at a server with one or more processors and memory, and one or more programs stored in the memory that execute on the one or more processors; selecting a first subset of document identifiers from a set of document identifiers corresponding to documents on a network; computing priority scores for the first subset of document identifiers; forming a second subset of document identifiers based on the priority scores and status data collected during one or more previous crawls by the search engine crawler and by removing from the first subset one or more document identifiers identified as unreachable in a plurality of prior crawls or associated with download errors in a plurality of prior crawls; and scheduling for crawling the second subset of document identifiers; wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier. - View Dependent Claims (6, 7, 8, 9, 10)
-
-
11. A non-transitory computer-readable medium having stored thereon instructions, which when executed by a processor in a scheduler system for a search engine crawler, cause the processor to perform the operations of:
-
selecting a first subset of document identifiers from a set of document identifiers corresponding to documents on a network; computing priority scores for the first subset of document identifiers; forming a second subset of document identifiers based on the priority scores and status data collected during one or more previous crawls by the search engine crawler and by removing from the first subset one or more document identifiers identified as unreachable in a plurality of prior crawls or associated with download errors in a plurality of prior crawls; and scheduling for crawling the second subset of document identifiers; wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier. - View Dependent Claims (12, 13, 14, 15)
-
Specification