Document reuse in a search engine crawler

US 10,216,847 B2
Filed: 06/08/2017
Issued: 02/26/2019
Est. Priority Date: 07/03/2003
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

at a computing system having one or more processors and memory storing one or more programs executed by the one or more processors;

retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and

performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and method are provided for setting a respective reuse flag for a corresponding document in a plurality of documents based on a query-independent score associated with the corresponding document. A document crawling operation is performed on the plurality of documents in accordance with the reuse flag for respective documents in the plurality of documents. This document crawling operation includes reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer in accordance with a determination that the reuse flag associated with the respective document meets a predefined criterion.

Citations

19 Claims

1. A method, comprising:
- at a computing system having one or more processors and memory storing one or more programs executed by the one or more processors;
  
  retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and
  
  performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the plurality of records for a respective document include:
    - a first record including a first document identifier and a first crawl time for the respective document, anda second record including a second document identifier and a second crawl time for the respective document,wherein the first document identifier and the second document identifier correspond to the respective document, andwherein performing the document crawling operation on the plurality of documents further includes, when the document importance score does not exceed the first threshold;
      
      reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer when a difference between the first crawl time and a second crawl time does not exceed a second threshold, ordownloading a current version of the respective document from the host computer when a difference between the first crawl time and a second crawl time exceeds the second threshold.
  - 3. The method of claim 1, further comprising:
    - retrieving, from a first database, at least a subset of contents of a first document in the plurality of documents, the first document corresponding to a document identifier in a first scheduler record, the first scheduler record including a reuse flag for the first document set to a first state; and
      
      retrieving, from a second database, at least a subset of contents of a second document in the plurality of documents, the second document corresponding to a document identifier in a second scheduler record, the second scheduler record including a reuse flag for the second document set to a second state;
      
      wherein the first and second databases are not the same and wherein the second database stores content from previously crawled documents, including the second document.
  - 4. The method of claim 3, wherein the first database includes a set of servers and the set of servers is interconnected by a network.
  - 5. The method of claim 3, wherein the first database includes the World Wide Web.
  - 6. The method of claim 3, wherein the first database includes a subset of the World Wide Web.
  - 7. The method of claim 3, further comprising adding the at least a subset of the contents of the first document to the second database.
  - 8. The method of claim 1, wherein the document importance score associated with the document is based, at least in part, on a number of other documents that reference the document.

9. A computer system comprising:
- one or more processors;
  
  memory storing data and one or more programs executed by the one or more processors, the stored data and the one or more programs comprising instructions for;
  
  retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and
  
  performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer system of claim 9, wherein the plurality of records for a respective document include:
    - a first record including a first document identifier and a first crawl time for the respective document, anda second record including a second document identifier and a second crawl time for the respective document,wherein the first document identifier and the second document identifier correspond to the respective document, andwherein performing the document crawling operation on the plurality of documents further includes, when the document importance score does not exceed the first threshold;
      
      reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer when a difference between the first crawl time and a second crawl time does not exceed a second threshold, ordownloading a current version of the respective document from the host computer when a difference between the first crawl time and a second crawl time exceeds the second threshold.
  - 11. The computer system of claim 9, the stored data and the one or more programs comprising instructions for:
    - retrieving, from a first database, at least a subset of contents of a first document in the plurality of documents, the first document corresponding to a document identifier in a first scheduler record, the first scheduler record including a reuse flag for the first document set to a first state; and
      
      retrieving, from a second database, at least a subset of contents of a second document in the plurality of documents, the second document corresponding to a document identifier in a second scheduler record, the second scheduler record including a reuse flag for the second document set to a second state;
      
      wherein the first and second databases are not the same and wherein the second database stores content from previously crawled documents, including the second document.
  - 12. The computer system of claim 11, wherein the first database includes a set of servers and the set of servers is interconnected by a network.
  - 13. The computer system of claim 11, wherein the first database includes the World Wide Web.
  - 14. The computer system of claim 11, wherein the first database includes a subset of the World Wide Web.
  - 15. The computer system of claim 11, the stored data and one or more programs further comprising instructions for adding the at least a subset of the contents of the first document to the second database.
  - 16. The computer system of claim 9, wherein the document importance score associated with the document is based, at least in part, on a number of other documents that reference the document.

17. A non-transitory computer-readable storage medium having stored thereon instructions which, when executed by one or more processors of a computer system, cause the one or more processors to perform operations of:
- retrieving a plurality of records corresponding to prior scheduled crawls of respective documents in a plurality of documents; and
  
  performing a document crawling operation on the plurality of documents, wherein the document crawling operation includes downloading a current version of a respective document from a host computer based on a determination that a document importance score for the respective document exceeds a first threshold, wherein the document importance score is based on a query-independent metric of an importance of the document for a search engine, or reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer based on a determination that the document importance score does not exceed a first threshold.
- View Dependent Claims (18, 19)
- - 18. The non-transitory computer-readable storage medium of claim 17, wherein the plurality of records for a respective document include:
    - a first record including a first document identifier and a first crawl time for the respective document, anda second record including a second document identifier and a second crawl time for the respective document,wherein the first document identifier and the second document identifier correspond to the respective document, andwherein performing the document crawling operation on the plurality of documents further includes, when the document importance score does not exceed the first threshold;
      
      reusing a previously downloaded version of a respective document in the plurality of documents instead of downloading a current version of the respective document from a host computer when a difference between the first crawl time and a second crawl time does not exceed a second threshold, ordownloading a current version of the respective document from the host computer when a difference between the first crawl time and a second crawl time exceeds the second threshold.
  - 19. The non-transitory computer-readable storage medium of claim 17, wherein the instructions, when executed by the one or more processors, cause the one or more processors to perform operations of:
    - retrieving, from a first database, at least a subset of contents of a first document in the plurality of documents, the first document corresponding to a document identifier in a first scheduler record, the first scheduler record including a reuse flag for the first document set to a first state; and
      
      retrieving, from a second database, at least a subset of contents of a second document in the plurality of documents, the second document corresponding to a document identifier in a second scheduler record, the second scheduler record including a reuse flag for the second document set to a second state;
      
      wherein the first and second databases are not the same and wherein the second database stores content from previously crawled documents, including the second document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Zhu, Huican, Acharya, Anurag, Ibel, Max, Gobioff, Howard B.
Primary Examiner(s)
Wai, Eric C

Application Number

US15/617,634
Publication Number

US 20180089317A1
Time in Patent Office

628 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Document reuse in a search engine crawler

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Document reuse in a search engine crawler

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links