Duplicate document detection in a web crawler system

US 7,627,613 B1
Filed: 07/03/2003
Issued: 12/01/2009
Est. Priority Date: 07/03/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of detecting duplicate documents in a network crawling system, comprising, at a server having one or more processors and memory:

constructing a plurality of tables, each table corresponding to a portion of a document address space, storing information identifying documents having a same document content identifier and each identified document having an associated document rank;

wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;

receiving a newly crawled document, such document characterized by a document content identifier and a document rank;

reading information stored in the plurality of tables to identify a set of documents sharing the document content identifier of the newly crawled document, and ascertaining an original representative document for the identified set of documents;

updating the information stored in at least one of the tables in accordance with the document ranks of the identified set of documents and the newly crawled document;

determining a representative document for the newly crawled document and the identified set of documents;

indexing the representative document when the representative document is the newly crawled document; and

repeating the receiving, reading, updating, determining and indexing operations with respect to a plurality of newly crawled documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the newly crawled documents are determined to be representative documents and are indexed.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

123 Citations

30 Claims

1. A computer-implemented method of detecting duplicate documents in a network crawling system, comprising, at a server having one or more processors and memory:
- constructing a plurality of tables, each table corresponding to a portion of a document address space, storing information identifying documents having a same document content identifier and each identified document having an associated document rank;
  
  wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
  
  receiving a newly crawled document, such document characterized by a document content identifier and a document rank;
  
  reading information stored in the plurality of tables to identify a set of documents sharing the document content identifier of the newly crawled document, and ascertaining an original representative document for the identified set of documents;
  
  updating the information stored in at least one of the tables in accordance with the document ranks of the identified set of documents and the newly crawled document;
  
  determining a representative document for the newly crawled document and the identified set of documents;
  
  indexing the representative document when the representative document is the newly crawled document; and
  
  repeating the receiving, reading, updating, determining and indexing operations with respect to a plurality of newly crawled documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the newly crawled documents are determined to be representative documents and are indexed.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein information identifying the identified set of documents, including a particular document serving as the original representative document of the identified set, is stored in one or more tables.
  - 3. The method of claim 2, wherein the determining includescomparing the document rank of the newly crawled document with that of the particular document from the identified set in accordance with a set of predefined comparison criteria;
    - selecting the newly crawled document as the representative document if the set of predefined comparison criteria are met; and
      
      keeping the particular document as the representative document if the set of predefined comparison criteria is not met.
  - 4. The method of claim 3, wherein the set of predefined comparison criteria comprise at least two parameters, one parameter for comparison with an absolute difference of document ranks between the newly crawled document and the particular document, and another parameter for comparison with a ratio of document ranks between the newly crawled document and the particular document.
  - 5. The method of claim 1, wherein the updating includes inserting information identifying the newly crawled document into the at least one table only when a predefined insertion condition is satisfied.
  - 6. The method of claim 5, wherein the predefined insertion condition is that the document rank of the newly crawled document is higher than the document rank of at least one document in the identified set of documents.

7. A computer-implemented method of detecting duplicate documents in a network crawling system, comprising, at a server having one or more processors and memory:
- constructing a plurality of tables, each table corresponding to a segment of a document address space, storing information identifying documents having a same document content identifier and each identified document having an associated document rank, wherein the plurality of tables comprise N+1 tables where N is an integer greater than one, wherein the N+1 tables comprise N tables, each generated during a respective phase of a set of N crawling phases, and a current table generated during a current one of the N crawling phases, wherein an oldest one of the N tables was generated during a previous instance of the current crawling phase;
  
  receiving a newly crawled document, such document characterized by a document content identifier and a document rank;
  
  wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
  
  reading information stored in the N+1 tables to identify a set of documents sharing the document content identifier of the newly crawled document, and ascertaining an original representative document for the identified set of documents;
  
  updating the information stored in the current table in accordance with the document rankings of the identified set of documents and the newly crawled document;
  
  determining a representative document for the newly crawled document and the identified set of documents;
  
  indexing the representative document when said representative document is the newly crawled document;
  
  repeating the receiving, reading, updating, determining and indexing operations with respect to a plurality of newly crawled documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the newly crawled documents are determined to be representative documents and are indexed; and
  
  upon completion of the current crawling phase, retiring the oldest one of the N tables.
- View Dependent Claims (8, 9)
- - 8. The method of claim 7, wherein the reading comprises reading from a merged table that stores information from a plurality of the N tables, and reading from the current table.
  - 9. The method of claim 7, wherein information identifying the identified set of documents, including a particular document serving as the original representative document of the identified set, is stored in one or more tables.

10. A system for detecting duplicate documents during network crawling, comprising:
- one or more central processing units for executing programs;
  
  a network interface for receiving documents; and
  
  a duplicate document detection engine executable by the one or more central processing units, the engine comprising;
  
  a plurality of tables, each table corresponding to a segment of a document address space, storing information identifying documents having a same document content identifier and each identified document having an associated document rank, wherein the plurality of tables comprise N+1 tables where N is an integer greater than one, wherein the N+1 tables comprise N tables, each generated during a respective phase of a set of N crawling phases, and a current table generated during a current one of the N crawling phases, wherein an oldest one of the N tables was generated during a previous instance of the current crawling phase;
  
  instructions for receiving a newly crawled document, such document characterized by a document content identifier and a document rank;
  
  wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
  
  instructions for reading information stored in the N+1 tables to identify a set of documents, sharing the document content identifier of the newly crawled document, and ascertaining an original representative document for the identified set of documents;
  
  instructions for updating the information stored in the current table in accordance with the document rankings of the identified set of documents and the newly crawled document;
  
  instructions for determining a representative document for the newly crawled document and the identified set of documents;
  
  instructions for indexing the representative document when said representative document is the newly crawled document;
  
  instructions for repeating the receiving, reading, updating, determining and indexing operations with respect to a plurality of newly crawled documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the newly crawled documents are determined to be representative documents and are indexed; and
  
  instructions for retiring the oldest one of the N tables upon completion of the current crawling phase.
- View Dependent Claims (11, 12)
- - 11. The system of claim 10 wherein the reading comprises reading from a merged table that stores information from a plurality of the N tables, and reading from the current table.
  - 12. The system of claim 10, wherein the identified set of documents, including a particular document serving as the original representative document of the identified set, are stored in one or more tables.

13. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
- instructions for constructing a plurality of data structures for storing information of documents, each document characterized by a document content identifier and a document rank, the information stored in the plurality of data structures include the document content identifier and a document rank for each document;
  
  wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
  
  instructions for receiving a requesting document in association with its document content identifier and document rank;
  
  instructions for selecting from the plurality of data structures a set of documents sharing the same document content identifier as the requesting document, and ascertaining an original representative document for the identified set of documents;
  
  instructions for generating a new set of documents from the requesting document and the selected set of documents in accordance with their document rank;
  
  instructions for identifying a representative document of the new set of documents;
  
  instructions for indexing the representative document when said representative document is the requesting document; and
  
  instructions for repeating the receiving, selecting, generating, identifying, and indexing operations with respect to a plurality of requesting documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the requesting documents are determined to be representative documents and are indexed.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
- - 14. The computer program product of claim 13, wherein the plurality of data structures include a data structure for storing information of multiple sets of documents, each set of documents sharing a same document content.
  - 15. The computer program product of claim 13, wherein the plurality of data structures include a data structure for storing information of multiple sets of documents, each set of documents sharing a same document address.
  - 16. The computer program product of claim 13, wherein the document content identifier is a fixed length fingerprint of document content of a document characterized by the document content identifier.
  - 17. The computer program product of claim 13, wherein the document content identifier is a fixed length fingerprint of an address of a document characterized by the document content identifier.
  - 18. The computer program product of claim 13, wherein the generating instructions includesorting the requesting document and the selected set of documents in accordance with a metric included in score information of the requesting document and selected set of documents;
    - andselecting a new set of documents, having at most a predefined number of documents, from the requesting document and the selected set of documents based on the sorting result.
  - 19. The computer program product of claim 13, whereinthe score information for each document includes a document rank;
    - andthe identifying instructions includecomparing the document rank of the requesting document with that of a particular document from the selected set of documents in accordance with a set of predefined comparison criteria, wherein the particular document was previously determined to be the representative document for the selected set of documents;
      
      selecting the requesting document as the representative document for the new set of documents if the set of predefined comparison criteria are met; and
      
      keeping the particular document as the representative document for the new set of documents if the set of predefined comparison criteria is not met.
  - 20. The computer program product of claim 19, wherein the set of predefined comparison criteria comprise at least two parameters, one parameter for comparison with an absolute difference of document rank between the requesting document and the particular document, and another parameter for comparison with a ratio of document rank between the requesting document and the particular document.
  - 21. The computer program product of claim 13, wherein a document is a temporary redirect page comprising a document content, a source document address, and a target document address.

22. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
- instructions for constructing a plurality of tables, each table corresponding to a portion of a document address space, storing information identifying documents having a same document content identifier and each identified document having an associated document rank;
  
  wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
  
  instructions for receiving a newly crawled document, such document characterized by a document content identifier and a document rank;
  
  instructions for reading information stored in the plurality of tables to identify a set of documents sharing the document content identifier of the newly crawled document, and ascertaining an original representative document for the identified set of documents;
  
  instructions for updating the information stored in at least one of the tables in accordance with the document ranks of the identified set of documents and the newly crawled document;
  
  instructions for determining a representative document for the newly crawled document and the identified set of documents;
  
  instructions for indexing the representative document when said representative document is the newly crawled document; and
  
  instructions for repeating the receiving, reading, updating, determining and indexing operations with respect to a plurality of newly crawled documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the newly crawled documents are determined to be representative documents and are indexed.
- View Dependent Claims (23, 24, 25, 26, 27)
- - 23. The computer program product of claim 22, wherein information identifying the identified set of documents, including a particular document serving as the original representative document of the identified set, is stored in one or more tables.
  - 24. The computer program product of claim 23, wherein the determining includescomparing the document rank of the newly crawled document with that of the particular document from the identified set in accordance with a set of predefined comparison criteria;
    - selecting the newly crawled document as the representative document if the set of predefined comparison criteria are met; and
      
      keeping the particular document as the representative document if the set of predefined comparison criteria is not met.
  - 25. The computer program product of claim 23, wherein the set of predefined comparison criteria comprise at least two parameters, one parameter for comparison with an absolute difference of document ranks between the newly crawled document and the particular document, and another parameter for comparison with a ratio of document ranks between the newly crawled document and the particular document.
  - 26. The computer program product of claim 22, wherein the updating includes inserting information identifying the newly crawled document into the at least one table only when a predefined insertion condition is satisfied.
  - 27. The computer program product of claim 22, wherein the predefined insertion condition is that the document rank of the newly crawled document is higher than the document rank of at least one document in the identified set of documents.

28. A computer program product of detecting duplicate documents for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
- instructions for constructing a plurality of tables, each table corresponding to a segment of a document address space, storing information identifying documents having a same document content identifier and each identified document having an associated document rank, wherein the plurality of tables comprise N+1 tables where N is an integer greater than one, wherein the N+1 tables comprise N tables, each generated during a respective phase of a set of N crawling phases, and a current table generated during a current one of the N crawling phases, wherein an oldest one of the N tables was generated during a previous instance of the current crawling phase;
  
  instructions for receiving a newly crawled document, such document characterized by a document content identifier and a document rank;
  
  wherein documents having the same document content identifier have the same content and documents having different document content identifiers have different content;
  
  instructions for reading information stored in the N+1 tables to identify a set of documents sharing the document content identifier of the newly crawled document, and ascertaining an original representative document for the identified set of documents;
  
  instructions for updating the information stored in the current table in accordance with the document rankings of the identified set of documents and the newly crawled document;
  
  instructions for determining a representative document for the newly crawled document and the identified set of documents;
  
  instructions for indexing the representative document when said representative document is the newly crawled document;
  
  instructions for repeating the receiving, reading, updating, determining and indexing operations with respect to a plurality of newly crawled documents, each of which shares a respective document content identifier with a respective set of documents, such that at least some of the newly crawled documents are determined to be representative documents and are indexed; and
  
  instructions for retiring the oldest one of the N tables upon completion of the current crawling phase.
- View Dependent Claims (29, 30)
- - 29. The computer program product of claim 28, wherein the reading comprises reading from a merged table that stores information from a plurality of the N tables, and reading from the current table.
  - 30. The computer program product of claim 28, wherein the identified set of documents, including a particular document serving as the original representative document of the identified set, is stored in one or more tables.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Ghemawat, Sanjay, Dean, Jeffrey A., Verstak, Alexandre A., Dulitz, Daniel
Primary Examiner(s)
Vo; Tim T.
Assistant Examiner(s)
Morrison; Jay A

Application Number

US10/614,111
Time in Patent Office

2,343 Days
Field of Search

707/203
US Class Current

1/1
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99954   Version management

Duplicate document detection in a web crawler system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

123 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Duplicate document detection in a web crawler system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

123 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links