Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

US 8,407,204 B2
Filed: 06/22/2011
Issued: 03/26/2013
Est. Priority Date: 08/30/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:

on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors;

associating with each of a plurality of documents a respective initial web crawl interval;

partitioning the plurality of documents into a plurality of tiers according to their respective web crawl intervals, each tier in the plurality of tiers having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals;

associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and

moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system is disclosed for associating an appropriate web crawl interval with a document so that the probability of the document'"'"'s stale content being used by a search engine is below an acceptable level when the search engine crawls the document at its associated web crawl interval. The web crawl interval of a document is determined through an iterative process and updated dynamically by the search engine after every visit to the document by a web crawler. A multi-tier data structure is employed for managing the web crawl order of billions of documents on the Internet. The search engine may move a document from one tier to another if its web crawl interval is changed significantly.

102 Citations

View as Search Results

45 Claims

1. A computer-implemented method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
- on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors;
  
  associating with each of a plurality of documents a respective initial web crawl interval;
  
  partitioning the plurality of documents into a plurality of tiers according to their respective web crawl intervals, each tier in the plurality of tiers having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals;
  
  associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and
  
  moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, further comprising, for a respective tier of the plurality of tiers, scheduling downloads of at least a subset of the documents assigned to the tier in accordance with their respective web crawl intervals.
  - 3. The method of claim 1, wherein associating the revised web crawl interval with the respective document comprises:
    - updating the web crawl interval of the respective document after retrieving a new copy of the respective document'"'"'s content from its host and detecting content changes to the respective document based on the new copy.
  - 4. The method of claim 3, further comprising,identifying a first average interval between times where the respective document'"'"'s content has not changed;
    - identifying a second average interval between times where the respective document'"'"'s content has changed; and
      
      updating the web crawl interval of the respective document based on the first average interval and the second average interval.
  - 5. The method of claim 4, wherein the web crawl interval is not updated if the first average interval is greater than the second average interval.
  - 6. The method of claim 4, wherein the web crawl interval is updated in accordance with the first average interval if the first average interval is greater than the second average interval and the respective document'"'"'s content has not changed, and is updated in accordance with the second average interval if the respective document'"'"'s content has changed.
  - 7. The method of claim 4, wherein the web crawl interval is updated in accordance with an average between the first average interval and the second average interval if the first average interval is less than the second average interval.
  - 8. The method of claim 3, wherein the changes to the content of the respective document comprise critical content changes and non-critical content changes, and the computer system considers the critical content changes to the respective document and ignores the non-critical content changes to the document.
  - 9. The method of claim 1, further comprising:
    - determining for respective documents of the plurality of documents, content update rates of the respective documents, user click rates of the respective documents, and at least one document importance metric of the respective documents;
      
      associating the revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval, any changes to content of the document, the user click rate of the document, and the at least one document importance metric of the document; and
      
      downloading and recording new copies of at least a subset of the documents in accordance with the determined web crawl order.
  - 10. The method of claim 1, wherein the initial web crawl interval of a document assigned to a respective tier is determined based at least in part on an average web crawl interval of all other documents assigned to the respective tier.
  - 11. The method of claim 1, wherein the initial web crawl interval of the respective document is determined in accordance with a score corresponding to the respective document'"'"'s popularity.
  - 12. The method of claim 1, wherein the revised web crawl interval of the respective document is smaller than the respective document'"'"'s content update interval and larger than half of the respective document'"'"'s content update interval.
  - 13. The method of claim 1, further comprising:
    - dynamically adjusting the revised web crawl interval of the respective document after a new copy of the respective document'"'"'s content is retrieved.
  - 14. The method of claim 1, wherein the revised web crawl interval of the respective document is a time interval such that the search engine, on average, will retrieve a unique version of document content at least twice according to the revised web crawl interval.
  - 15. The method of claim 1, wherein the revised web crawl interval of the respective document is determined in accordance with the respective document'"'"'s user click interval when the respective document'"'"'s user click rate is larger than the respective document'"'"'s content update rate.

16. A computer system, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
- one or more central processing units for executing programs;
  
  memory storing a web crawl order scheduler to be executed by the one or more central processing units;
  
  the web crawl order scheduler comprising instructions for;
  
  associating with each of a plurality of documents a respective initial web crawl interval;
  
  partitioning the plurality of documents into a plurality of tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals;
  
  associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and
  
  moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 17. The system of claim 16, further comprising instructions for:
    - for a respective tier of the plurality of tiers, scheduling downloads of at least a subset of the documents assigned to the tier in accordance with their respective web crawl intervals.
  - 18. The system of claim 16, wherein instructions for associating the revised web crawl interval with the respective document further comprise instructions for:
    - updating the web crawl interval of the respective document after retrieving a new copy of the respective document'"'"'s content from its host and detecting content changes to the respective document based on the new copy.
  - 19. The system of claim 18, further comprising instructions for:
    - identifying a first average interval between times where the respective document'"'"'s content has not changed;
      
      identifying a second average interval between times where the respective document'"'"'s content has changed; and
      
      updating the web crawl interval of the respective document based on the first average interval and the second average interval.
  - 20. The system of claim 19, wherein the web crawl interval is not updated if the first average interval is greater than the second average interval.
  - 21. The system of claim 19, wherein the web crawl interval is updated in accordance with the first average interval if the first average interval is greater than the second average interval and the respective document'"'"'s content has not changed, and is updated in accordance with the second average interval if the respective document'"'"'s content has changed.
  - 22. The system of claim 19, wherein the web crawl interval is updated in accordance with an average between the first average interval and the second average interval if the first average interval is less than the second average interval.
  - 23. The system of claim 18, wherein the changes to the content of the respective document comprise critical content changes and non-critical content changes, and the instruction further comprise instructions for considering the critical content changes to the respective document and ignoring the non-critical content changes to the document.
  - 24. The system of claim 16, wherein the web crawl order scheduler further comprises instructions for:
    - determining for respective documents of the plurality of documents, content update rates of the respective documents, user click rates of the respective documents, and at least one document importance metric of the respective documents;
      
      associating the revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval, any changes to content of the document, the user click rate of the document, and the at least one document importance metric of the document; and
      
      downloading and recording new copies of at least a subset of the documents in accordance with the determined web crawl order.
  - 25. The system of claim 16, wherein the initial web crawl interval of a document assigned to a respective tier is determined based at least in part on an average web crawl interval of all other documents assigned to the respective tier.
  - 26. The system of claim 16, wherein the initial web crawl interval of the respective document is determined in accordance with a score corresponding to the respective document'"'"'s popularity.
  - 27. The system of claim 16, wherein the revised web crawl interval of the respective document is smaller than the respective document'"'"'s content update interval and larger than half of the respective document'"'"'s content update interval.
  - 28. The system of claim 16, further comprising instructions for:
    - dynamically adjusting the revised web crawl interval of the respective document after a new copy of the respective document'"'"'s content is retrieved.
  - 29. The system of claim 16, wherein the revised web crawl interval of the respective document is a time interval such that the search engine, on average, will retrieve a unique version of document content at least twice according to the revised web crawl interval.
  - 30. The system of claim 16, wherein the revised web crawl interval of the respective document is determined in accordance with the respective document'"'"'s user click interval when the respective document'"'"'s user click rate is larger than the respective document'"'"'s content update rate.

31. A non-transitory computer readable storage medium, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, storing one or more programs to be executed by a computer system, the one or more programs comprising instructions for:
- associating with each of a plurality of documents a respective initial web crawl interval;
  
  partitioning the plurality of documents into plurality of tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals;
  
  associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and
  
  moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
- - 32. The computer readable storage medium of claim 31, further comprising instructions for:
    - for a respective tier of the plurality of tiers, scheduling downloads of at least a subset of the documents assigned to the tier in accordance with their respective web crawl intervals.
  - 33. The computer readable storage medium of claim 31, wherein instructions for associating the revised web crawl interval with the respective document further comprise instructions for:
    - updating the web crawl interval of the respective document after retrieving a new copy of the respective document'"'"'s content from its host and detecting content changes to the respective document based on the new copy.
  - 34. The computer readable storage medium of claim 33, further comprising instructions for:
    - identifying a first average interval between times where the respective document'"'"'s content has not changed;
      
      identifying a second average interval between times where the respective document'"'"'s content has changed; and
      
      updating the web crawl interval of the respective document based on the first average interval and the second average interval.
  - 35. The computer readable storage medium of claim 34, wherein the web crawl interval is not updated if the first average interval is greater than the second average interval.
  - 36. The computer readable storage medium of claim 34, wherein the web crawl interval is updated in accordance with the first average interval if the first average interval is greater than the second average interval and the respective document'"'"'s content has not changed, and is updated in accordance with the second average interval if the respective document'"'"'s content has changed.
  - 37. The computer readable storage medium of claim 34, wherein the web crawl interval is updated in accordance with an average between the first average interval and the second average interval if the first average interval is less than the second average interval.
  - 38. The computer readable storage medium of claim 33, wherein the changes to the content of the respective document comprise critical content changes and non-critical content changes, and the instruction further comprise instructions for considering the critical content changes to the respective document and ignoring the non-critical content changes to the document.
  - 39. The computer readable storage medium of claim 31, further comprising instructions for:
    - determining for respective documents of the plurality of documents, content update rates of the respective documents, user click rates of the respective documents, and at least one document importance metric of the respective documents;
      
      associating the revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval, any changes to content of the document, the user click rate of the document, and the at least one document importance metric of the document; and
      
      downloading and recording new copies of at least a subset of the documents in accordance with the determined web crawl order.
  - 40. The computer readable storage medium of claim 31, wherein the initial web crawl interval of a document assigned to a respective tier is determined based at least in part on an average web crawl interval of all other documents assigned to the respective tier.
  - 41. The computer readable storage medium of claim 31, wherein the initial web crawl interval of the respective document is determined in accordance with a score corresponding to the respective document'"'"'s popularity.
  - 42. The computer readable storage medium of claim 31, wherein the revised web crawl interval of the respective document is smaller than the respective document'"'"'s content update interval and larger than half of the respective document'"'"'s content update interval.
  - 43. The computer readable storage medium of claim 31, further comprising instructions for:
    - dynamically adjusting the revised web crawl interval of the respective document after a new copy of the respective document'"'"'s content is retrieved.
  - 44. The computer readable storage medium of claim 31, wherein the revised web crawl interval of the respective document is a time interval such that the search engine, on average, will retrieve a unique version of document content at least twice according to the revised web crawl interval.
  - 45. The computer readable storage medium of claim 31, wherein the revised web crawl interval of the respective document is determined in accordance with the respective document'"'"'s user click interval when the respective document'"'"'s user click rate is larger than the respective document'"'"'s content update rate.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Carver, Anton P. T.
Primary Examiner(s)
Ly, Anh

Application Number

US13/166,757
Publication Number

US 20110258176A1
Time in Patent Office

643 Days
Field of Search

707/709, 707/706, 707/803, 707/708, 707/710, 707/E17.116, 707/E17.117, 707/E17.108, 709/217, 709/224
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

102 Citations

45 Claims

Specification

Solutions

Use Cases

Quick Links

Minimizing visibility of stale content in web searching including revising web crawl intervals of documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

102 Citations

45 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links