Minimizing visibility of stale content in web searching including revising web crawl intervals of documents
First Claim
1. A method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
- on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors;
associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document'"'"'s importance metric;
retrieving each of the plurality of documents from a respective host at one or more times associated with each document'"'"'s respective initial web crawl interval;
associating a revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval and any changes to content of the document;
partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals;
for each tier of the multiple tiers, determining a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl interval; and
moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system is disclosed for associating an appropriate web crawl interval with a document so that the probability of the document'"'"'s stale content being used by a search engine is below an acceptable level when the search engine crawls the document at its associated web crawl interval. The web crawl interval of a document is determined through an iterative process and updated dynamically by the search engine after every visit to the document by a web crawler. A multi-tier data structure is employed for managing the web crawl order of billions of documents on the Internet. The search engine may move a document from one tier to another if its web crawl interval is changed significantly.
-
Citations
48 Claims
-
1. A method for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
-
on a computer system having one or more processors and memory storing one or more programs for execution by the one or more processors; associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document'"'"'s importance metric; retrieving each of the plurality of documents from a respective host at one or more times associated with each document'"'"'s respective initial web crawl interval; associating a revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval and any changes to content of the document; partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with the documents'"'"' respective web crawl intervals; for each tier of the multiple tiers, determining a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl interval; and moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 30)
-
-
16. A search engine system, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, comprising:
-
one or more central processing units for executing programs; memory storing a web crawl order scheduler to be executed by the one or more central processing units; the web crawl order scheduler comprising; instructions for associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document'"'"'s importance metric; instructions for retrieving each of the plurality of documents from a respective host at one or more times associated with each document'"'"'s respective initial web crawl interval; instructions for associating a revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval and any changes to content of the document; instructions for partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including instructions for storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with their respective web crawl intervals; instructions for determining, for each tier of the multiple tiers, a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl intervals; and instructions for moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document. - View Dependent Claims (17, 18, 19, 20, 21, 22, 31, 32, 33, 34, 35, 36, 37, 38, 39)
-
-
23. A non-transitory computer readable storage medium, for scheduling documents to be crawled by a search engine in an appropriate order to reduce visibility of stale content in web searching, storing one or more programs to be executed by a computer system, wherein the computer system includes a search engine program, the one or more programs comprising:
-
instructions for associating with each of a plurality of documents a respective initial web crawl interval, wherein the initial web crawl interval is determined based at least in part on a document'"'"'s importance metric; instructions for retrieving each of the plurality of documents from a respective host at one or more times associated with each document'"'"'s respective initial web crawl interval; instructions for associating a revised web crawl interval with a respective document of the plurality of documents based on the document'"'"'s respective initial web crawl interval and any changes to content of the document; instructions for partitioning the plurality of documents into multiple tiers according to their respective web crawl intervals, each tier having a distinct associated range of web crawl intervals, including instructions for storing data for each tier identifying documents in the plurality of documents assigned to that tier in accordance with their respective web crawl intervals; instructions for determining, for each tier of the multiple tiers, a web crawl order for the documents assigned to the tier in accordance with their respective revised web crawl intervals; and instructions for moving a respective document between tiers of the multiple tiers when the respective revised web crawl interval of the respective document is associated with a different tier of the multiple tiers than a previous web crawl interval of the respective document. - View Dependent Claims (24, 25, 26, 27, 28, 29, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
Specification