Scheduler for search engine crawler

US 8,161,033 B2
Filed: 05/25/2010
Issued: 04/17/2012
Est. Priority Date: 07/03/2003
Status: Active Grant

First Claim

Patent Images

1. A method of scheduling document indexing, comprising:

at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors;

retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and

for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;

determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document;

determining a first score for the document identifier that is a function of both the determined query-independent score and the determined content change frequency of the corresponding document;

comparing the first score against a threshold value; and

conditionally scheduling the document for indexing based on the result of the comparison.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A scheduler for a search engine crawler includes a history log containing document identifiers (e.g., URLs) corresponding to documents (e.g., web pages) on a network (e.g., Internet). The scheduler is configured to process each document identifier in a set of the document identifiers by determining a content change frequency of the document corresponding to the document identifier, determining a first score for the document identifier that is a function of the determined content change frequency of the corresponding document, comparing the first score against a threshold value, and scheduling the corresponding document for indexing based on the results of the comparison.

Citations

15 Claims

1. A method of scheduling document indexing, comprising:
- at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors;
  
  retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and
  
  for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;
  
  determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document;
  
  determining a first score for the document identifier that is a function of both the determined query-independent score and the determined content change frequency of the corresponding document;
  
  comparing the first score against a threshold value; and
  
  conditionally scheduling the document for indexing based on the result of the comparison.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein generating the first score includes generating the first score from the product of the square of the query-independent score and the determined content change frequency.
  - 3. The method of claim 1, wherein the scheduling of a document for indexing includes assigning the document to a particular crawl segment by associating the document identifier with a segment identifier corresponding to the crawl segment.
  - 4. The method of claim 1, wherein the threshold value is determined using a score computed for each document identifier in a sample set of document identifiers.
  - 5. The method of claim 4, wherein the threshold value is further determined using a target size of a set of documents to be crawled.

6. A scheduler system for a search engine crawler, comprising:
- a computer;
  
  a history log containing document identifiers corresponding to documents on a network previously indexed by the search engine crawler, wherein each document identifier has a corresponding document; and
  
  a scheduler, executed by the computer, and configured to process each document identifier in a set of the document identifiers in the history log by determining a query-independent score indicative of a rank of the corresponding document relative to other documents on the network, determining a content change frequency of the document corresponding to the document identifier by comparing information stored for successive downloads of the corresponding document, determining a first score for the document identifier that is a function of both the query-independent score and the determined content change frequency of the corresponding document, comparing the first score against a threshold value, and conditionally scheduling the corresponding document for indexing based on the results of the comparison,wherein the history log and scheduler are stored on computer-readable media.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The scheduler system of claim 6, wherein the scheduler is configured to generate the first score from the product of the square of the query-independent score and the determined content change frequency.
  - 8. The scheduler system of claim 6, wherein the document is assigned to a particular crawl segment using a record in the history log that associates the document identifier with a segment identifier corresponding to the crawl segment.
  - 9. The scheduler system of claim 6, wherein the threshold value is determined using a score computed for each document identifier in a sample set of document identifiers.
  - 10. The scheduler system of claim 9, wherein the threshold value is further determined using a target size of a set of documents to be crawled.

11. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the operations of:
- retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network;
  
  for each retrieved document identifier,determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;
  
  determining a content change frequency of the corresponding document identifier by comparing information stored for successive downloads of the corresponding document and determining a first score for the document identifier that is a function of both the query-independent score and the determined content change frequency of the corresponding document;
  
  comparing the first score against a threshold value; and
  
  conditionally scheduling the document for indexing based on the result of the comparison.
- View Dependent Claims (12, 13, 14, 15)
- - 12. A computer-readable storage medium of claim 11, wherein generating the first score includes generating the first score from the product of the square of the query-independent score and the determined content change frequency.
  - 13. A computer-readable storage medium of claim 11, wherein the scheduling of a document for indexing includes assigning the document to a particular crawl segment by associating the document identifier with a segment identifier corresponding to the crawl segment.
  - 14. A computer-readable storage medium of claim 11, wherein the threshold value is determined using a score computed for each document identifier in a sample set of document identifiers.
  - 15. A computer-readable storage medium of claim 14, wherein the threshold value is further determined using a target size of a set of documents to be crawled.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Randall, Keith H.
Primary Examiner(s)
Stevens, Robert

Application Number

US12/787,321
Publication Number

US 20100241621A1
Time in Patent Office

693 Days
Field of Search

707/709, 707/711, 707/E17.108, 707/705
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Scheduler for search engine crawler

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Scheduler for search engine crawler

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links