Scheduler for search engine crawler
First Claim
Patent Images
1. A method of scheduling document indexing, comprising:
- at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors;
retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and
for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a page rank of the corresponding document relative to other documents in a set of documents;
determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document;
determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document by the crawler system;
determining a first score for the document identifier that is a function of the determined query-independent score and the determined content change frequency and the determined age of the corresponding document;
comparing the first score against a threshold value; and
conditionally scheduling the document for indexing based on the result of the comparison.
2 Assignments
0 Petitions
Accused Products
Abstract
A scheduler for a search engine crawler includes a history log containing document identifiers (e.g., URLs) corresponding to documents (e.g., web pages) on a network (e.g., Internet). The scheduler is configured to process each document identifier in a set of the document identifiers by determining a content change frequency of the document corresponding to the document identifier, determining a first score for the document identifier that is a function of the determined content change frequency of the corresponding document, comparing the first score against a threshold value, and scheduling the corresponding document for indexing based on the results of the comparison.
-
Citations
18 Claims
-
1. A method of scheduling document indexing, comprising:
-
at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors; retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and for each retrieved document identifier and its corresponding document, determining a query-independent score indicative of a page rank of the corresponding document relative to other documents in a set of documents; determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document; determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document by the crawler system; determining a first score for the document identifier that is a function of the determined query-independent score and the determined content change frequency and the determined age of the corresponding document; comparing the first score against a threshold value; and conditionally scheduling the document for indexing based on the result of the comparison. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A scheduler system for a search engine crawler, comprising:
-
a computer; a history log containing document identifiers corresponding to documents on a network previously indexed by the search engine crawler, wherein each document identifier has a corresponding document; and a scheduler, executed by the computer, and configured to process each document identifier in a set of the document identifiers in the history log by determining a query-independent score indicative of a page rank of the corresponding document relative to other documents on the network, determining a content change frequency of the document corresponding to the document identifier by comparing information stored for successive downloads of the corresponding document, determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document by the crawler, determining a first score for the document identifier that is a function of the query-independent score and the determined content change frequency and the determined age of the corresponding document, comparing the first score against a threshold value, and conditionally scheduling the corresponding document for indexing based on the results of the comparison, wherein the history log and scheduler are stored on computer-readable media. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the operations of:
-
retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; for each retrieved document identifier, determining a query-independent score indicative of a page rank of the corresponding document relative to other documents in a set of documents; determining a content change frequency of the corresponding document identifier by comparing information stored for successive downloads of the corresponding document, determining an age of the corresponding document, wherein the age is associated with the time of the last download of the corresponding document, and determining a first score for the document identifier that is a function of the query-independent score and the determined content change frequency and the determined age of the corresponding document; comparing the first score against a threshold value; and conditionally scheduling the document for indexing based on the result of the comparison. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification