Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
First Claim
1. A computer-implemented method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
- on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors;
associating with each of a plurality of documents a respective initial web crawl interval;
partitioning the plurality of documents into a plurality of tiers according to their respective web crawl intervals, each tier in the plurality of tiers having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals;
associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and
moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system is disclosed for associating an appropriate web crawl interval with a document so that the probability of the document'"'"'s stale content being used by a search engine is below an acceptable level when the search engine crawls the document at its associated web crawl interval. The web crawl interval of a document is determined through an iterative process and updated dynamically by the search engine after every visit to the document by a web crawler. A multi-tier data structure is employed for managing the web crawl order of billions of documents on the Internet. The search engine may move a document from one tier to another if its web crawl interval is changed significantly.
102 Citations
45 Claims
-
1. A computer-implemented method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
-
on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors; associating with each of a plurality of documents a respective initial web crawl interval; partitioning the plurality of documents into a plurality of tiers according to their respective web crawl intervals, each tier in the plurality of tiers having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals; associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer system, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
-
one or more central processing units for executing programs; memory storing a web crawl order scheduler to be executed by the one or more central processing units; the web crawl order scheduler comprising instructions for; associating with each of a plurality of documents a respective initial web crawl interval; partitioning the plurality of documents into a plurality of tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals; associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A non-transitory computer readable storage medium, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, storing one or more programs to be executed by a computer system, the one or more programs comprising instructions for:
-
associating with each of a plurality of documents a respective initial web crawl interval; partitioning the plurality of documents into plurality of tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals; associating a revised web crawl interval with a respective document of the plurality of documents, including updating the web crawl interval of the respective document to be less than the initial web crawl interval when the respective document'"'"'s content has changed, and updating the web crawl interval to be more than the initial web crawl interval when the respective document'"'"'s content has not changed; and moving the respective document between tiers of the plurality of tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the plurality of tiers than a previous web crawl interval of the respective document. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
-
Specification