Representative document selection for sets of duplicate documents in a web crawler system

US 8,260,781 B2
Filed: 07/19/2011
Issued: 09/04/2012
Est. Priority Date: 07/03/2003
Status: Expired due to Term

First Claim

Patent Images

1. A method of detecting duplicate documents, comprising:

at a server having one or more processors and memory;

receiving documents from one or more databases of documents, wherein each received document is associated with a respective query independent score;

generating a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content;

indexing at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and

while performing the indexing,identifying respective sets of received documents having the same content identifier,selecting a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents,indexing the representative document, andwith respect to each respective set of received documents, including only the representative document in the document index.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Citations

15 Claims

1. A method of detecting duplicate documents, comprising:
- at a server having one or more processors and memory;
  
  receiving documents from one or more databases of documents, wherein each received document is associated with a respective query independent score;
  
  generating a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content;
  
  indexing at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and
  
  while performing the indexing,identifying respective sets of received documents having the same content identifier,selecting a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents,indexing the representative document, andwith respect to each respective set of received documents, including only the representative document in the document index.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the document content identifier for a respective document comprises a fixed length fingerprint of document content of the respective document.
  - 3. The method of claim 1, including computing respective scores for respective received documents, including a respective score for each document in a respective set of documents having the same content identifier, and selecting the representative document in accordance with at least the scores for the documents in the respective set of documents.
  - 4. The method of claim 1, further including performing a data processing operation, with respect to each respective set of received documents, on only the representative document for the respective set of received documents.
  - 5. The method of claim 1, wherein the query independent score associated with a respective document indicates the respective document'"'"'s importance or popularity.

6. A system for detecting duplicated documents, comprising:
- one or more processors;
  
  memory storing one or more programs executable by the one or more processors, the one or more programs comprising instructions to;
  
  receive documents from one or more databases of documents, wherein each received document is associated with a respective query independent score;
  
  obtain a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content;
  
  index at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and
  
  while performing the indexing,identify respective sets of received documents having the same content identifier,select a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents,index the representative document, andwith respect to each respective set of received documents, include only the representative document in the document index.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system of claim 6, wherein the document content identifier for a respective document comprises a fixed length fingerprint of document content of the respective document.
  - 8. The system of claim 6, wherein the one or more programs further comprise instructions to:
    - compute respective scores for respective received documents, including a respective score for each document in a respective set of documents having the same content identifier; and
      
      select the representative document in accordance with at least the scores for the documents in the respective set of documents.
  - 9. The system of claim 6, wherein the one or more programs further comprise instructions that, when executed by the one or more processors, cause the system to perform a data processing operation, with respect to each respective set of received documents, on only the representative document for the respective set of received documents.
  - 10. The system of claim 6, wherein the query independent score associated with a respective document indicates the respective document'"'"'s importance or popularity.

11. A non-transitory computer readable storage medium storing one or more programs that when executed by one or more processors of a computer system cause the computer system to:
- receive documents from one or more databases of documents, wherein each received document is associated with a respective query independent score;
  
  obtain a document content identifier for each received document, each document content identifier comprising an identifier of a respective document'"'"'s content;
  
  index at least a subset of the received documents to produce an document index that maps terms to documents in the one or more databases of documents; and
  
  while performing the indexing,identify respective sets of received documents having the same content identifier;
  
  select a single document in each respective set of documents, in accordance with the query independent scores associated with the documents in the respective set of documents, as a representative document for the respective set of received documents;
  
  index the representative document; and
  
  with respect to each respective set of received documents, include only the representative document in the document index.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The non-transitory computer readable medium of claim 11, wherein the document content identifier for a respective document comprises a fixed length fingerprint of document content of the respective document.
  - 13. The non-transitory computer readable medium of claim 11, wherein the one or more programs further comprising instructions that when executed by the one or more processors cause the computer system to:
    - compute respective scores for respective received documents, including a respective score for each document in a respective set of documents having the same content identifier, andselect the representative document for the respective set of documents in accordance with at least the computed scores for the documents in the respective set of documents.
  - 14. The non-transitory computer readable medium of claim 11, wherein the one or more programs further comprising instructions that when executed by the one or more processors cause the computer system to:
    - perform a data processing operation, with respect to each respective set of received documents, on only the representative document for the respective set of received documents.
  - 15. The non-transitory computer readable medium of claim 11, wherein the query independent score associated with a respective document indicates the respective document'"'"'s importance or popularity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Dulitz, Daniel, Verstak, Alexandre A., Ghemawat, Sanjay, Dean, Jeffrey A.
Primary Examiner(s)
MORRISON, JAY A

Application Number

US13/186,414
Publication Number

US 20110276561A1
Time in Patent Office

413 Days
Field of Search

None
US Class Current

707/736
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99954   Version management

Representative document selection for sets of duplicate documents in a web crawler system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Representative document selection for sets of duplicate documents in a web crawler system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links