Document reuse in a search engine crawler
First Claim
1. A method of scheduling a document for crawling by a search engine crawler, comprising:
- at a server system having one or more processors and memory storing one or more programs executed by the one or more processors;
retrieving a plurality of records corresponding to prior scheduled crawls of the document, each record comprising a set of entries, the set of entries comprising a document identifier, a crawl time, and a crawl type;
setting a reuse flag for the document based, at least in part, on information in the retrieved plurality of records;
outputting the reuse flag to a scheduler record, the scheduler record comprising a document identifier and the reuse flag; and
when performing a document crawling operation, and the reuse flag for the document has a predefined reuse value, reusing a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes;
at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the predefined reuse value by indexing the previously downloaded version of the document, wherein the search index includes a plurality of potential search terms and documents associated with potential search terms.
2 Assignments
0 Petitions
Accused Products
Abstract
A search engine crawler includes a scheduler for determining which documents to download from their respective host servers. Some documents, known to be stable based on one or more record from prior crawls, are reused from a document repository. A reuse flag is set in a scheduler record that also contains a document identifier, the reuse flag indicating whether the document should be retrieved from a first database, such as the World Wide Web, or a second database, such as a document repository. A set of such scheduler records are used during a crawl by the search engine crawler to determine which database to use when retrieving the documents identified in the scheduler records.
-
Citations
37 Claims
-
1. A method of scheduling a document for crawling by a search engine crawler, comprising:
-
at a server system having one or more processors and memory storing one or more programs executed by the one or more processors; retrieving a plurality of records corresponding to prior scheduled crawls of the document, each record comprising a set of entries, the set of entries comprising a document identifier, a crawl time, and a crawl type; setting a reuse flag for the document based, at least in part, on information in the retrieved plurality of records; outputting the reuse flag to a scheduler record, the scheduler record comprising a document identifier and the reuse flag; and when performing a document crawling operation, and the reuse flag for the document has a predefined reuse value, reusing a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes; at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the predefined reuse value by indexing the previously downloaded version of the document, wherein the search index includes a plurality of potential search terms and documents associated with potential search terms. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 35)
-
-
14. A server system for retrieving a document for indexing by a search engine crawler, comprising:
-
one or more processors; memory storing data and one or more programs for execution by the one or more processors;
the stored data and one or more programs comprising;a log comprising at least a first record corresponding to the document; and a decision module configured to set the state of a reuse flag corresponding to the document based, at least in part, on the first record; wherein the at least the first record includes a first document identifier, a first document score, a first crawl time, and a first crawl type; and a search engine crawler module that, when performing a document crawling operation, and the reuse flag corresponding to the document has a predefined reuse value, reuses a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes; at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the predefined reuse value by indexing the previously downloaded version of the document, wherein the search index includes a plurality of potential search terms and documents associated with potential search terms. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 36)
-
-
24. A non-transitory computer-readable medium having stored thereon instructions which, when executed by one or more processors of a computer system, cause the one or more processors to perform the operations of:
-
retrieving a first record corresponding to a document, the first record comprising a set of entries, the set of entries comprising a first document identifier, a first document score, a first crawl time, and a first crawl type; setting a reuse flag corresponding to the document based, at least in part, on a subset of the set of entries; outputting the reuse flag for the document to a scheduler record, the scheduler record comprising a document identifier and the reuse flag; and performing a document crawling operation, including, when performing the document crawling operation, and the reuse flag for the document has a predefined reuse value, reusing a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes; at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the redefined reuse value by indexing the previously downloaded version of the document wherein the search index includes a plurality of potential search terms and documents associated with potential search terms. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 37)
-
Specification