Scheduler for search engine crawler

US 8,775,403 B2
Filed: 04/17/2012
Issued: 07/08/2014
Est. Priority Date: 07/03/2003
Status: Active Grant

First Claim

Patent Images

1. A method of scheduling document indexing, comprising:

at a computing system having one or more processors and memory storing programs for execution by the one or more processors;

retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and

for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;

determining a first score for the document identifier that is a function of the determined query-independent score, a determined content change frequency of the corresponding document, and an age of the corresponding document;

comparing the first score against a threshold value thereby obtaining a result, wherein the threshold value is a function of a speed of the engine crawler system; and

conditionally scheduling the corresponding document for indexing based on the result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A scheduler for a search engine crawler includes a history log containing document identifiers (e.g., URLs) corresponding to documents (e.g., web pages) on a network (e.g., Internet). The scheduler is configured to process each document identifier in a set of the document identifiers by determining a content change frequency of the document corresponding to the document identifier, determining a first score for the document identifier that is a function of the determined content change frequency of the corresponding document, comparing the first score against a threshold value, and scheduling the corresponding document for indexing based on the results of the comparison.

108 Citations

42 Claims

1. A method of scheduling document indexing, comprising:
- at a computing system having one or more processors and memory storing programs for execution by the one or more processors;
  
  retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and
  
  for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;
  
  determining a first score for the document identifier that is a function of the determined query-independent score, a determined content change frequency of the corresponding document, and an age of the corresponding document;
  
  comparing the first score against a threshold value thereby obtaining a result, wherein the threshold value is a function of a speed of the engine crawler system; and
  
  conditionally scheduling the corresponding document for indexing based on the result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1, wherein the first score for the document identifier is a function of the determined query-independent score.
  - 3. The method of claim 1 wherein the first score for the document identifier is a function of the determined content change frequency of the corresponding document.
  - 4. The method of claim 3, wherein the content change frequency of the corresponding document is determined by computing content checksums stored in a history log for successive downloads of the corresponding document.
  - 5. The method of claim 3, wherein the function is the identify function.
  - 6. The method of claim 1, wherein the first score for the document identifier is a function of an age of the corresponding document.
  - 7. The method of claim 6, wherein the age of the corresponding document is determined by computing time since last download of the corresponding document by the search engine crawler system.
  - 8. The method of claim 6, wherein the function is the identify function.
  - 9. The method of claim 1, wherein the function is a product of two factors selected from the group consisting of the determined query-independent score, the determined content change frequency of the corresponding document, and the age of the corresponding document.
  - 10. The method of claim 9, wherein a factor is raised to a power.
  - 11. The method of claim 1, wherein the conditionally scheduling the corresponding document for indexing based on the result assigns the document to a periodic crawl layer when the result is deemed satisfactory.
  - 12. The method of claim 11, wherein the periodic crawl layer is a daily crawl layer.
  - 13. The method of claim 11, wherein the periodic crawl layer is a real-time crawl layer.
  - 14. The method of claim 11, the method further comprising assigning a crawl frequency to the corresponding document.
  - 15. The method of claim 14, wherein the crawl frequency is computed based on a historical change frequency of the corresponding document and the query-independent score of the corresponding document.
  - 16. The method of claim 11, the method further comprising assigning a crawl score to the corresponding document.
  - 17. The method of claim 16, wherein the crawl score is determined by a function of the query-independent score, a determined content change frequency of the corresponding document, an age of the corresponding document or a crawl history of the corresponding document.
  - 18. The method of claim 16, wherein the crawl score is determined by a function of one or more features selected from the group consisting of the query-independent score, a determined content change frequency of the corresponding document, an age of the corresponding document and a crawl history of the corresponding document.
  - 19. The method of claim 11, the method further comprising calculating a keep score for the corresponding document, where the corresponding document is removed from the periodic crawl layer when the keep score is sufficiently low relative to the keep score of other documents in the periodic crawl layer.
  - 20. The method of claim 19, wherein the keep score of the corresponding document is set to equal the query-independent score of the corresponding document.

21. A computing system, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and
  
  for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;
  
  determining a first score for the document identifier that is a function of the determined query-independent score, a determined content change frequency of the corresponding document, and an age of the corresponding document;
  
  comparing the first score against a threshold value thereby obtaining a result, wherein the threshold value is a function of a speed of the engine crawler system; and
  
  conditionally scheduling the corresponding document for indexing based on the result.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 22. The system of claim 21, wherein the first score for the document identifier is a function of the determined query-independent score.
  - 23. The system of claim 21 wherein the first score for the document identifier is a function of the determined content change frequency of the corresponding document.
  - 24. The system of claim 23, wherein the content change frequency of the corresponding document is determined by computing content checksums stored in a history log for successive downloads of the corresponding document.
  - 25. The system of claim 23, wherein the function is the identify function.
  - 26. The system of claim 21, wherein the first score for the document identifier is a function of an age of the corresponding document.
  - 27. The system of claim 26, wherein the age of the corresponding document is determined by computing time since last download of the corresponding document by the search engine crawler system.
  - 28. The system of claim 26, wherein the function is the identify function.
  - 29. The system of claim 21, wherein the function is a product of two factors selected from the group consisting of the determined query-independent score, the determined content change frequency of the corresponding document, and the age of the corresponding document.
  - 30. The system of claim 29, wherein a factor is raised to a power.
  - 31. The system of claim 21, wherein the conditionally scheduling the corresponding document for indexing based on the result assigns the document to a periodic crawl layer when the result is deemed satisfactory.
  - 32. The system of claim 31, wherein the periodic crawl layer is a daily crawl layer.
  - 33. The system of claim 31, wherein the periodic crawl layer is a real-time crawl layer.
  - 34. The system of claim 31, the method further comprising assigning a crawl frequency to the corresponding document.
  - 35. The system of claim 34, wherein the crawl frequency is computed based on a historical change frequency of the corresponding document and the query-independent score of the corresponding document.
  - 36. The system of claim 31, the method further comprising assigning a crawl score to the corresponding document.
  - 37. The method of claim 36, wherein the crawl score is determined by a function of the query-independent score, a determined content change frequency of the corresponding document, an age of the corresponding document or a crawl history of the corresponding document.
  - 38. The system of claim 36, wherein the crawl score is determined by a function of one or more features selected from the group consisting of the query-independent score, a determined content change frequency of the corresponding document, an age of the corresponding document and a crawl history of the corresponding document.
  - 39. The system of claim 31, the method further comprising calculating a keep score for the corresponding document, where the corresponding document is removed from the periodic crawl layer when the keep score is sufficiently low relative to the keep score of other documents in the periodic crawl layer.
  - 40. The system of claim 39, wherein the keep score of the corresponding document is set to equal the query-independent score of the corresponding document.

41. A non-transitory computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computer system with one or more processors, cause the computer system to:
- retrieve a number of document identifiers, each document identifier identifying a corresponding document on a network; and
  
  for each retrieved document identifier and its corresponding document,determine a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;
  
  determine a first score for the document identifier that is a function of the determined query-independent score, a determined content change frequency of the corresponding document, and an age of the corresponding document;
  
  compare the first score against a threshold value thereby obtaining a result, wherein the threshold value is a function of a speed of the engine crawler system; and
  
  conditionally schedule the corresponding document for indexing based on the result.
- View Dependent Claims (42)
- - 42. The A non-transitory computer readable storage medium of claim 41, wherein the first score for the document identifier is a function of the determined query-independent score.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Randall, Keith H.
Primary Examiner(s)
Alam, Hosain
Assistant Examiner(s)
ABRAHAM, AHMED M

Application Number

US13/449,228
Publication Number

US 20120317089A1
Time in Patent Office

812 Days
Field of Search

707/705, 707/706, 707/709, 707/710, 707/711, 707/712
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Scheduler for search engine crawler

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

108 Citations

42 Claims

Specification

Solutions

Use Cases

Quick Links

Scheduler for search engine crawler

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

108 Citations

42 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links