Document reuse in a search engine crawler

US 8,707,312 B1
Filed: 06/30/2004
Issued: 04/22/2014
Est. Priority Date: 07/03/2003
Status: Active Grant

First Claim

Patent Images

1. A method of scheduling a document for crawling by a search engine crawler, comprising:

at a server system having one or more processors and memory storing one or more programs executed by the one or more processors;

retrieving a plurality of records corresponding to prior scheduled crawls of the document, each record comprising a set of entries, the set of entries comprising a document identifier, a crawl time, and a crawl type;

setting a reuse flag for the document based, at least in part, on information in the retrieved plurality of records;

outputting the reuse flag to a scheduler record, the scheduler record comprising a document identifier and the reuse flag; and

when performing a document crawling operation, and the reuse flag for the document has a predefined reuse value, reusing a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes;

at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the predefined reuse value by indexing the previously downloaded version of the document, wherein the search index includes a plurality of potential search terms and documents associated with potential search terms.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A search engine crawler includes a scheduler for determining which documents to download from their respective host servers. Some documents, known to be stable based on one or more record from prior crawls, are reused from a document repository. A reuse flag is set in a scheduler record that also contains a document identifier, the reuse flag indicating whether the document should be retrieved from a first database, such as the World Wide Web, or a second database, such as a document repository. A set of such scheduler records are used during a crawl by the search engine crawler to determine which database to use when retrieving the documents identified in the scheduler records.

Citations

37 Claims

1. A method of scheduling a document for crawling by a search engine crawler, comprising:
- at a server system having one or more processors and memory storing one or more programs executed by the one or more processors;
  
  retrieving a plurality of records corresponding to prior scheduled crawls of the document, each record comprising a set of entries, the set of entries comprising a document identifier, a crawl time, and a crawl type;
  
  setting a reuse flag for the document based, at least in part, on information in the retrieved plurality of records;
  
  outputting the reuse flag to a scheduler record, the scheduler record comprising a document identifier and the reuse flag; and
  
  when performing a document crawling operation, and the reuse flag for the document has a predefined reuse value, reusing a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes;
  
  at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the predefined reuse value by indexing the previously downloaded version of the document, wherein the search index includes a plurality of potential search terms and documents associated with potential search terms.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 35)
- - 2. The method of claim 1, wherein setting the reuse flag is based, at least in part, on a document score associated with the document.
  - 3. The method of claim 1, wherein the retrieved records each further comprises a content summary, the method further including:
    - retrieving a first record, the first record comprising a first document identifier, a first document score, a first crawl time, a first crawl type, and a first content summary;
      
      retrieving a second record, the second record comprising a second document identifier, a second document score, a second crawl time, a second crawl type, and a second content summary;
      
      wherein the first document identifier and the second document identifier correspond to a single document; and
      
      wherein setting a reuse flag includes comparing the first content summary to the second content summary.
  - 4. The method of claim 3, wherein the reuse flag is set to a first state if the first content summary and the second content summary differ.
  - 5. The method of claim 1, further comprising:
    - retrieving a first record, the first record comprising a first document identifier, a first document score, a first crawl time, and a first crawl type;
      
      retrieving a second record, the second record comprising a second document identifier, a second document score, a second crawl time, and a second crawl type;
      
      wherein the first document identifier and the second document identifier correspond to a single document; and
      
      wherein setting the reuse flag includes comparing the first crawl type to the second crawl type.
  - 6. The method of claim 5, wherein setting the reuse flag further includes setting the reuse flag to a first state if the first crawl type and the second crawl type are both equal to a predefined value.
  - 7. The method of claim 1, further comprising:
    - retrieving, from a first database, at least a subset of the contents of a first document, the document corresponding to a document identifier in a first scheduler record, the first scheduler record including a reuse flag set to a first state; and
      
      retrieving, from a second database, at least a subset of the contents of a second document, the second document corresponding to a document identifier in a second scheduler record, the second scheduler record including a reuse flag set to a second state;
      
      wherein the first and second databases are not the same and wherein the second database stores content from previously crawled documents, including the second document.
  - 8. The method of claim 7, wherein the first database is a set of servers and the set of servers is interconnected by a network.
  - 9. The method of claim 7, wherein the first database is the World Wide Web.
  - 10. The method of claim 7, wherein the first database comprises a subset of the World Wide Web.
  - 11. The method of claim 7, further comprising adding the at least a subset of the contents of the first document to the second database.
  - 12. The method of claim 1, further comprising:
    - receiving the scheduler record;
      
      based on the state of the reuse flag, selecting a database from amongst a predetermined set of databases; and
      
      transmitting a request for a document corresponding to the document identifier in the scheduler record to the selected database.
  - 13. The method of claim 12, further comprising:
    - when the state of the reuse flag in the scheduler record is a predefined value;
      
      transmitting a request for a document corresponding to the first document identifier to a server, the request including information corresponding to the first crawl time;
      
      receiving a response from the server, the response including information indicative of the most recent time the document has been modified; and
      
      retrieving the document from a first database when the information in the response indicates that the document was last modified at a time later than the first crawl time; and
      
      otherwise retrieving the document from a second database.
  - 35. The method of claim 1, including accessing the previously downloaded version of the document from a local repository associated with the server, and performing content processing on both the previously downloaded version of the document and other documents downloaded from respective host servers.

14. A server system for retrieving a document for indexing by a search engine crawler, comprising:
- one or more processors;
  
  memory storing data and one or more programs for execution by the one or more processors;
  
  the stored data and one or more programs comprising;
  
  a log comprising at least a first record corresponding to the document; and
  
  a decision module configured to set the state of a reuse flag corresponding to the document based, at least in part, on the first record;
  
  wherein the at least the first record includes a first document identifier, a first document score, a first crawl time, and a first crawl type; and
  
  a search engine crawler module that, when performing a document crawling operation, and the reuse flag corresponding to the document has a predefined reuse value, reuses a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes;
  
  at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the predefined reuse value by indexing the previously downloaded version of the document, wherein the search index includes a plurality of potential search terms and documents associated with potential search terms.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 36)
- - 15. The server system of claim 14, further comprising:
    - a crawler comprising a first network interface to a first database and a second network interface to a second database;
      
      wherein the crawler transmits a request to the first database when the reuse flag is set to a first state; and
      
      wherein the crawler transmits a request to the second database when the reuse flag is set to a second state;
      
      wherein the first and second databases are not the same and wherein the second database stores content from previously crawled documents.
  - 16. The server system of claim 15, wherein the first database is the World Wide Web.
  - 17. The server system of claim 15, wherein the first database comprises a subset of the World Wide Web.
  - 18. The server system of claim 15, wherein the crawler is further configured to receive a document from the first database and store the received document to the second database.
  - 19. The server system of claim 15, wherein the crawler is further configured to retrieve a document corresponding to the first document identifier, when the reuse flag is set to a third state, by:
    - transmitting a request for a document corresponding to the first document identifier to the first database, the request including information corresponding to the first crawl time; and
      
      receiving a response from the first database, the response including information indicative of the most recent time the document has been modified; and
      
      receiving the document corresponding to the first document identifier when the document was last modified at a time later than the first crawl time; and
      
      otherwise receive the document corresponding to the first document identifier from the second database.
  - 20. The server system of claim 14, further comprising:
    - a second record, the second record comprising a second document identifier, a second document score, a second crawl time, a second crawl type, and a second content summary;
      
      wherein the first record further includes a first document content summary;
      
      wherein the first document identifier and the second document identifier correspond to a single document; and
      
      wherein the decision module is configured to set the state of the reuse flag based, at least in part, on a comparison of the first content summary and the second content summary.
  - 21. The server system of claim 20, wherein decision module is configured to set the state of the reuse flag to a first state if the first content summary and the second content summary differ.
  - 22. The server system of claim 14, further comprising:
    - a second record, the second record comprising a second document identifier, a second document score, a second crawl time, and a second crawl type;
      
      wherein the first document identifier and the second document identifier correspond to a single document; and
      
      wherein the decision module is configured to set the state of the reuse flag based, at least in part, on a comparison of the first crawl type and the second crawl type.
  - 23. The server system of claim 22, wherein decision module is configured to set the state of the reuse flag to a first state if the first crawl type and the second crawl type are both equal to a predefined value.
  - 36. The server system of claim 14, including:
    - a local repository associated with the server system, the local repository storing the previously downloaded version of the document; and
      
      a content processing module that performs content processing on both the previously downloaded version of the document accessed from the local repository and other documents downloaded from respective host servers.

24. A non-transitory computer-readable medium having stored thereon instructions which, when executed by one or more processors of a computer system, cause the one or more processors to perform the operations of:
- retrieving a first record corresponding to a document, the first record comprising a set of entries, the set of entries comprising a first document identifier, a first document score, a first crawl time, and a first crawl type;
  
  setting a reuse flag corresponding to the document based, at least in part, on a subset of the set of entries;
  
  outputting the reuse flag for the document to a scheduler record, the scheduler record comprising a document identifier and the reuse flag; and
  
  performing a document crawling operation, including, when performing the document crawling operation, and the reuse flag for the document has a predefined reuse value, reusing a previously downloaded version of the document instead of downloading a current version of the document from a host server, wherein reusing the previously downloaded version of the document includes;
  
  at the server system, while indexing documents obtained from the document crawling operation to build a search index, indexing the document having the reuse flag with the redefined reuse value by indexing the previously downloaded version of the document wherein the search index includes a plurality of potential search terms and documents associated with potential search terms.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 37)
- - 25. The computer-readable medium of claim 24, wherein the subset of the record includes the first document score.
  - 26. The computer-readable medium of claim 24, wherein the first record further comprises a first content summary, further comprising instructions for:
    - retrieving a second record, the second record comprising a second document identifier, a second document score, a second crawl time, a second crawl type, and a second content summary;
      
      wherein the first document identifier and the second document identifier correspond to a single document; and
      
      wherein setting a reuse flag includes comparing the first content summary to the second content summary.
  - 27. The computer-readable medium of claim 26, further including instructions to set the reuse flag to a first state if the first content summary and the second content summary differ.
  - 28. The computer-readable medium of claim 24, further including instructions for:
    - retrieving a second record, the second record comprising a second document identifier, a second document score, a second crawl time, and a second crawl type; and
      
      setting the reuse flag based on a comparison of the first crawl type to the second crawl type;
      
      wherein the first document identifier and the second document identifier correspond to a single document.
  - 29. The computer-readable medium of claim 28, further including instructions for setting the reuse flag to a first state if the first crawl type and the second crawl type are the same.
  - 30. The computer-readable medium of claim 24, further including instructions for:
    - retrieving, from a first database, at least a subset of the contents of a first document, the document corresponding to a document identifier in a first scheduler record, the first scheduler record including a reuse flag set to a first state; and
      
      retrieving, from a second database, at least a subset of the contents of a second document, the second document corresponding to a document identifier in a second scheduler record, the second scheduler record including a reuse flag set to a second state;
      
      wherein the first and second database are not the same and wherein the second database stores content from previously crawled documents, including the second document.
  - 31. The computer-readable medium of claim 30, wherein the first database is the World Wide Web.
  - 32. The computer-readable medium of claim 30, further comprising instructions for adding the at least a subset of the contents of the first document to the second database.
  - 33. The computer-readable medium of claim 30, further including instructions for retrieving at least a subset of the contents of a third document corresponding to a document identifier in a third scheduler record, the third scheduler record including a reuse flag set to a third state, by:
    - transmitting a request for the third document corresponding to the third document identifier to the first database, the request including information corresponding to a crawl time; and
      
      receiving a response from the server, the response including information indicative of the most recent time the document has been modified; and
      
      retrieving the document from a second database when the information in the response indicates that the document was last modified at a time earlier than the crawl time; and
      
      otherwise retrieving the document from the first database.
  - 34. The computer-readable medium of claim 24, further including instructions for:
    - selecting a database from amongst a predetermined set of databases, based on the state of the reuse flag; and
      
      transmitting a request for a document corresponding to the document identifier in the scheduler record to the selected database.
  - 37. The computer-readable medium of claim 24, including instructions for accessing the previously downloaded version of the document from a local repository associated with the server, and instructions for performing content processing on both the previously downloaded version of the document and other documents downloaded from respective host servers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Zhu, Huican, Ibel, Maximilian, Acharya, Anurag, Gobioff, Howard Bradley
Primary Examiner(s)
Zhe, Mengyao

Application Number

US10/882,955
Time in Patent Office

3,583 Days
Field of Search

718/102
US Class Current

718/102
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Document reuse in a search engine crawler

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Document reuse in a search engine crawler

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links