Scheduler for search engine crawler

US 7,725,452 B1
Filed: 05/20/2004
Issued: 05/25/2010
Est. Priority Date: 07/03/2003
Status: Active Grant

First Claim

Patent Images

1. A method of scheduling document indexing, comprising:

at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors;

retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and

for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a page rank of the corresponding document relative to other documents in a set of documents;

determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document;

determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document by the crawler system;

determining a first score for the document identifier that is a function of the determined query-independent score and the determined content change frequency and the determined age of the corresponding document;

comparing the first score against a threshold value; and

conditionally scheduling the document for indexing based on the result of the comparison.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A scheduler for a search engine crawler includes a history log containing document identifiers (e.g., URLs) corresponding to documents (e.g., web pages) on a network (e.g., Internet). The scheduler is configured to process each document identifier in a set of the document identifiers by determining a content change frequency of the document corresponding to the document identifier, determining a first score for the document identifier that is a function of the determined content change frequency of the corresponding document, comparing the first score against a threshold value, and scheduling the corresponding document for indexing based on the results of the comparison.

Citations

18 Claims

1. A method of scheduling document indexing, comprising:
- at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors;
  
  retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and
  
  for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a page rank of the corresponding document relative to other documents in a set of documents;
  
  determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document;
  
  determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document by the crawler system;
  
  determining a first score for the document identifier that is a function of the determined query-independent score and the determined content change frequency and the determined age of the corresponding document;
  
  comparing the first score against a threshold value; and
  
  conditionally scheduling the document for indexing based on the result of the comparison.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the scheduling of a document for indexing includes scheduling the document for a particular index segment indicated by a segment identifier associated with the document identifier.
  - 3. The method of claim 1, wherein the content change frequency for a respective document identifier is determined by comparing content checksums stored in a history log for successive downloads of the document corresponding to the document identifier.
  - 4. The method of claim 1, wherein the first score is determined from a content checksum stored in a history log.
  - 5. The method of claim 1, wherein the threshold value is determined using a score computed for each document identifier in a sample set of document identifiers.
  - 6. The method of claim 5, wherein the threshold value is further determined using a target size of a set of documents to be crawled.

7. A scheduler system for a search engine crawler, comprising:
- a computer;
  
  a history log containing document identifiers corresponding to documents on a network previously indexed by the search engine crawler, wherein each document identifier has a corresponding document; and
  
  a scheduler, executed by the computer, and configured to process each document identifier in a set of the document identifiers in the history log by determining a query-independent score indicative of a page rank of the corresponding document relative to other documents on the network, determining a content change frequency of the document corresponding to the document identifier by comparing information stored for successive downloads of the corresponding document, determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document by the crawler, determining a first score for the document identifier that is a function of the query-independent score and the determined content change frequency and the determined age of the corresponding document, comparing the first score against a threshold value, and conditionally scheduling the corresponding document for indexing based on the results of the comparison,wherein the history log and scheduler are stored on computer-readable media.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The scheduler system of claim 7, wherein the document is scheduled for a particular index segment indicated by a segment identifier in the history log.
  - 9. The scheduler system of claim 7, wherein the scheduler is configured to determine the content change frequency for a respective document identifier by comparing content checksums stored in the history log for successive downloads of the document corresponding to the document identifier.
  - 10. The scheduler system of claim 7, wherein the scheduler is configured to determine the content change frequency from a content checksum stored in a history log.
  - 11. The scheduler system of claim 7, wherein the threshold value is determined using a score computed for each document identifier in a sample set of document identifiers.
  - 12. The scheduler system of claim 11, wherein the threshold value is further determined using a target size of a set of documents to be crawled.

13. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the operations of:
- retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network;
  
  for each retrieved document identifier,determining a query-independent score indicative of a page rank of the corresponding document relative to other documents in a set of documents;
  
  determining a content change frequency of the corresponding document identifier by comparing information stored for successive downloads of the corresponding document, determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document, and determining a first score for the document identifier that is a function of the query-independent score and the determined content change frequency and the determined age of the corresponding document;
  
  comparing the first score against a threshold value; and
  
  conditionally scheduling the document for indexing based on the result of the comparison.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. A computer-readable storage medium of claim 13, wherein the scheduling of a document for indexing includes scheduling the document for a particular index segment indicated by a segment identifier associated with the document identifier.
  - 15. A computer-readable storage medium of claim 13, wherein determining the content change frequency for a respective document identifier includes comparing content checksums stored in a history log for successive downloads of the document corresponding to the document identifier.
  - 16. A computer-readable storage medium of claim 13, wherein determining the content change frequency for a respective document identifier comprises determining the content change frequency from a content checksum stored in a history log.
  - 17. A computer-readable storage medium of claim 13, wherein the threshold value is determined using a score computed for each document identifier in a sample set of document identifiers.
  - 18. A computer-readable storage medium of claim 17, wherein the threshold value is further determined using a target size of a set of documents to be crawled.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Randall, Keith H.
Primary Examiner(s)
Stevens; Robert

Application Number

US10/853,627
Time in Patent Office

2,196 Days
Field of Search

707/2, 707/10, 707/104.1, 707/705, 707/709
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Scheduler for search engine crawler

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Scheduler for search engine crawler

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links