Scheduler for search engine crawler
First Claim
Patent Images
1. A method of scheduling document indexing, comprising:
- at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors;
retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and
for each retrieved document identifier and its corresponding document,determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents;
determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document;
determining a first score for the document identifier that is a function of both the determined query-independent score and the determined content change frequency of the corresponding document;
comparing the first score against a threshold value; and
conditionally scheduling the document for indexing based on the result of the comparison.
2 Assignments
0 Petitions
Accused Products
Abstract
A scheduler for a search engine crawler includes a history log containing document identifiers (e.g., URLs) corresponding to documents (e.g., web pages) on a network (e.g., Internet). The scheduler is configured to process each document identifier in a set of the document identifiers by determining a content change frequency of the document corresponding to the document identifier, determining a first score for the document identifier that is a function of the determined content change frequency of the corresponding document, comparing the first score against a threshold value, and scheduling the corresponding document for indexing based on the results of the comparison.
-
Citations
15 Claims
-
1. A method of scheduling document indexing, comprising:
-
at a search engine crawler system having one or more processors and memory storing programs for execution by the one or more processors; retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; and for each retrieved document identifier and its corresponding document, determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents; determining a content change frequency of the corresponding document by comparing information stored for successive downloads of the corresponding document; determining a first score for the document identifier that is a function of both the determined query-independent score and the determined content change frequency of the corresponding document; comparing the first score against a threshold value; and conditionally scheduling the document for indexing based on the result of the comparison. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A scheduler system for a search engine crawler, comprising:
-
a computer; a history log containing document identifiers corresponding to documents on a network previously indexed by the search engine crawler, wherein each document identifier has a corresponding document; and a scheduler, executed by the computer, and configured to process each document identifier in a set of the document identifiers in the history log by determining a query-independent score indicative of a rank of the corresponding document relative to other documents on the network, determining a content change frequency of the document corresponding to the document identifier by comparing information stored for successive downloads of the corresponding document, determining a first score for the document identifier that is a function of both the query-independent score and the determined content change frequency of the corresponding document, comparing the first score against a threshold value, and conditionally scheduling the corresponding document for indexing based on the results of the comparison, wherein the history log and scheduler are stored on computer-readable media. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to perform the operations of:
-
retrieving a number of document identifiers, each document identifier identifying a corresponding document on a network; for each retrieved document identifier, determining a query-independent score indicative of a rank of the corresponding document relative to other documents in a set of documents; determining a content change frequency of the corresponding document identifier by comparing information stored for successive downloads of the corresponding document and determining a first score for the document identifier that is a function of both the query-independent score and the determined content change frequency of the corresponding document; comparing the first score against a threshold value; and conditionally scheduling the document for indexing based on the result of the comparison. - View Dependent Claims (12, 13, 14, 15)
-
Specification