Scheduler for search engine crawler

US 8,707,313 B1
Filed: 02/18/2011
Issued: 04/22/2014
Est. Priority Date: 07/03/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A scheduler system for a search engine crawler, comprising:

a memory for storing a set of document identifiers corresponding to documents on a network and associated status data collected during one or more previous crawls by the search engine crawler; and

a plurality of schedulers configured to select a subset of the document identifiers for crawling, at least some of the schedulers configured to compute priority scores for the subset of document identifiers and to schedule for crawling at least a portion of the subset of document identifiers based on the priority scores and status data;

wherein the scheduling includes removing from the subset one or more document identifiers that were unreachable in a plurality of consecutive prior crawls or had download errors in a plurality of consecutive prior crawls; and

wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A search engine crawler includes a distributed set of schedulers that are associated with one or more segments of document identifiers (e.g., URLs) corresponding to documents on a network (e.g., WWW). Each scheduler handles the scheduling of document identifiers (for crawling) for a subset of the known document identifiers. Using a starting set of document identifiers, such as the document identifiers crawled (or scheduled for crawling) during the most recent completed crawl, the scheduler removes from the starting set those document identifiers that have been unreachable in each of the last X crawls. Other filtering mechanisms may also be used to filter out some of the document identifiers in the starting set. The resulting list of document identifiers is written to a scheduled output file for use in a next crawl cycle.

101 Citations

View as Search Results

15 Claims

1. A scheduler system for a search engine crawler, comprising:
- a memory for storing a set of document identifiers corresponding to documents on a network and associated status data collected during one or more previous crawls by the search engine crawler; and
  
  a plurality of schedulers configured to select a subset of the document identifiers for crawling, at least some of the schedulers configured to compute priority scores for the subset of document identifiers and to schedule for crawling at least a portion of the subset of document identifiers based on the priority scores and status data;
  
  wherein the scheduling includes removing from the subset one or more document identifiers that were unreachable in a plurality of consecutive prior crawls or had download errors in a plurality of consecutive prior crawls; and
  
  wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the document identifiers stored in the memory are divided into multiple segments, and wherein each of a plurality of schedulers is configured to schedule a partition of the document identifiers that includes document identifiers assigned to each of the multiple segments.
  - 3. The system of claim 1, further comprising an unscheduled document identifier file for storing the document identifiers not scheduled for crawling by one or more of the plurality of schedulers.
  - 4. The system of claim 1, wherein each of the plurality of schedulers is configured to sort document identifiers by priority scores and schedule a predefined number N of the sorted document identifiers having highest priority scores.

5. A method of scheduling a search engine crawler using a scheduler system, comprising:
- at a server with one or more processors and memory, and one or more programs stored in the memory that execute on the one or more processors;
  
  selecting a first subset of document identifiers from a set of document identifiers corresponding to documents on a network;
  
  computing priority scores for the first subset of document identifiers;
  
  forming a second subset of document identifiers based on the priority scores and status data collected during one or more previous crawls by the search engine crawler and by removing from the first subset one or more document identifiers identified as unreachable in a plurality of prior crawls or associated with download errors in a plurality of prior crawls; and
  
  scheduling for crawling the second subset of document identifiers;
  
  wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier.
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. The method of claim 5, further comprising:
    - applying at least one exception filter to at least a plurality of the document identifiers to identify document identifiers to be excluded from a next crawl.
  - 7. The method of claim 5, further comprising:
    - removing document identifiers from the second subset in accordance with one or more scheduler limits.
  - 8. The method of claim 5, further comprising:
    - storing unscheduled document identifiers for crawling at a later time.
  - 9. The method of claim 5, further comprising:
    - sorting the first subset of document identifiers by priority scores; and
      
      scheduling a portion of the sorted document identifiers to be crawled.
  - 10. The method of claim 5, wherein the document identifiers are Uniform Resource Locators (URLs) and the documents are located on the World Wide Web (WWW).

11. A non-transitory computer-readable medium having stored thereon instructions, which when executed by a processor in a scheduler system for a search engine crawler, cause the processor to perform the operations of:
- selecting a first subset of document identifiers from a set of document identifiers corresponding to documents on a network;
  
  computing priority scores for the first subset of document identifiers;
  
  forming a second subset of document identifiers based on the priority scores and status data collected during one or more previous crawls by the search engine crawler and by removing from the first subset one or more document identifiers identified as unreachable in a plurality of prior crawls or associated with download errors in a plurality of prior crawls; and
  
  scheduling for crawling the second subset of document identifiers;
  
  wherein each priority score is computed as a product of a respective page importance and a respective boost factor, the respective boost factor being a scalar that is used to demote or promote the priority score of the respective document identifier.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The non-transitory computer-readable medium of claim 11, further comprising instructions for applying at least one exception filter to at least a plurality of the document identifiers to identify document identifiers to be excluded from a next crawl.
  - 13. The non-transitory computer-readable medium of claim 11, further comprising instructions for removing document identifiers from the second subset in accordance with one or more scheduler limits.
  - 14. The non-transitory computer-readable medium of claim 11, further comprising instructions for storing unscheduled document identifiers for crawling at a later time.
  - 15. The non-transitory computer-readable medium of claim 11, further comprisinginstructions for selecting one or more document identifiers whose corresponding documents will be reused based on historical status data;
    - andinstructions for retrieving and indexing documents corresponding to the selected document identifiers, wherein the documents are retrieved from a repository distinct from web hosts corresponding to the document identifiers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Zhu, Huican, Ibel, Maximilian, Acharya, Anurag, Gobioff, Howard Bradley
Primary Examiner(s)
Truong, Camquy

Application Number

US13/031,011
Time in Patent Office

1,159 Days
Field of Search

None
US Class Current

718/102
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Scheduler for search engine crawler

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

101 Citations

15 Claims

Specification

Use Cases

Quick Links

Others

Scheduler for search engine crawler

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

101 Citations

15 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others