Document reuse in a search engine crawler
First Claim
1. A method, comprising:
- at a computing system having one or more processors and memory storing one or more programs executed by the one or more processors;
retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and
performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and method are provided for setting a respective reuse flag for a corresponding document in a plurality of documents based on a query-independent score associated with the corresponding document. A document crawling operation is performed on the plurality of documents in accordance with the reuse flag for respective documents in the plurality of documents. This document crawling operation includes reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer in accordance with a determination that the reuse flag associated with the respective document meets a predefined criterion.
-
Citations
19 Claims
-
1. A method, comprising:
at a computing system having one or more processors and memory storing one or more programs executed by the one or more processors; retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
9. A computer system comprising:
-
one or more processors; memory storing data and one or more programs executed by the one or more processors, the stored data and the one or more programs comprising instructions for; retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable storage medium having stored thereon instructions which, when executed by one or more processors of a computer system, cause the one or more processors to perform operations of:
-
retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold. - View Dependent Claims (18, 19)
-
Specification