Detecting duplicate and near-duplicate files

US 6,658,423 B1
Filed: 01/24/2001
Issued: 12/02/2003
Est. Priority Date: 01/24/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for determining whether documents, in a large collection of documents, are near-duplicates, the method comprising:

a) for each of at least some of the documents in the large collection of documents, generating at least two fingerprints;

b) preprocessing the fingerprints to identify any fingerprints that are associated with only one document; and

c) determining whether or not documents are near-duplicate documents based on fingerprints other than those identified as being associated with only one document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

Citations

38 Claims

1. A method for determining whether documents, in a large collection of documents, are near-duplicates, the method comprising:
- a) for each of at least some of the documents in the large collection of documents, generating at least two fingerprints;
  
  b) preprocessing the fingerprints to identify any fingerprints that are associated with only one document; and
  
  c) determining whether or not documents are near-duplicate documents based on fingerprints other than those identified as being associated with only one document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 wherein the act of determining whether or not documents are near-duplicate documents includes:
3. The method of claim 1 wherein the act of generating at least two fingerprints for each of the documents includes:
- i) extracting parts from the document, ii) hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) populating a predetermined number of lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list.
4. The method of claim 3 wherein the act of hashing each of the extracted parts to generate a hash value for each of the extracted parts uses a hash function that is repeatable, deterministic and not sensitive to state.
5. The method of claim 3 wherein the parts extracted from the document are selected from a group of parts consisting of characters, words, sentences, paragraphs and sections.
6. The method of claim 3 wherein the parts extracted from the document do not overlap.
7. The method of claim 3 wherein the parts extracted from the document overlap.
8. The method of claim 3 wherein each of the acts of determining a fingerprint uses a hashing function with a low probability of collision.
9. The method of claim 3 wherein the act of determining a fingerprint uses a function that is sensitive to an order of the parts within a list.
10. The method of claim 3 wherein the act of determining a fingerprint uses a function that is insensitive to an order of the parts within a list.

11. An apparatus for determining whether documents, in a large collection of documents, are near-duplicates, the apparatus comprising:
- a) a fingerprint generator for generating, for each of the documents in the large collection of documents, at least two fingerprints;
  
  b) a preprocessor for identifying any fingerprints that are associated with only one document; and
  
  c) a fingerprint comparison facility for determining whether or not documents are near-duplicate documents based on fingerprints other than those identified as being associated with only one document.
- View Dependent Claims (12)
- - 12. The apparatus of claim 11 wherein the fingerprint generator includes:

13. A method for clustering documents, the method comprising:
- a) for each of the documents, generating at least two fingerprints; and
  
  b) for each of the documents, i) determining whether or not the document is a near-duplicate of any of previously processed documents, based on fingerprints of the documents, ii) if it is determined that the document is not a near-duplicate of any previously processed document, then associating the document with a unique cluster identifier, and iii) if it is determined that the document is a near-duplicate of a previously processed document, then associating the document with a cluster identifier associated with the previously processed document.

14. A method for filtering search results to remove near-duplicates, the method comprising:
- a) for each of a predetermined number of candidate search results, determining whether the candidate search result is a near-duplicate of another candidate search result; and
  
  b) if it is determined that the candidate search result is a near-duplicate of another candidate search result, then rejecting the candidate search result wherein the act of determining whether a candidate search result is a near-duplicate of another candidate search result includes i) comparing a cluster identifier of the candidate search result with that of the other candidate search result, and ii) if the cluster identifiers of the two candidate search results match, then concluding that the two candidate search results are near-duplicates, and wherein cluster identifiers of the candidate search results are assigned by;
  
  i) determining whether or not a document corresponding to the candidate search result is a near-duplicate of any of previously processed documents, ii) if it is determined that the document corresponding to the candidate search result is not a near-duplicate of any previously processed document, then associating the document with a unique cluster identifier, and iii) if it is determined that the document corresponding to the candidate search result is a near-duplicate of a previously processed document, then associating the document corresponding to the candidate search result with a cluster identifier associated with the previously processed document.

15. A method for determining whether two documents are near-duplicates, the method comprising:
- a) for each of the two documents, generating at least two fingerprints by i) extracting parts from the document, ii) hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) populating at least two lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
  
  b) determining whether or not the two documents are near-duplicate documents based on their fingerprints.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
- - 16. The method of claim 15 wherein the act of determining whether or not the two documents are near-duplicate documents includes:
17. The method of claim 15 wherein the act of hashing each of the extracted parts to generate a hash value for each of the extracted parts uses a hash function that is repeatable, deterministic and not sensitive to state.
18. The method of claim 15 wherein the parts extracted from the document are selected from a group of parts consisting of characters, words, sentences, paragraphs and sections.
19. The method of claim 15 wherein the parts extracted from the document do not overlap.
20. The method of claim 15 wherein the parts extracted from the document overlap.
21. The method of claim 15 wherein the act of determining a fingerprint uses a hashing function with a low probability of collision.
22. The method of claim 15 wherein the act of determining a fingerprint uses a function that is sensitive to an order of the parts within a list.
23. The method of claim 15 wherein the act of determining a fingerprint uses a function that is insensitive to an order of the parts within a list.

24. A method, for use in a crawling facility, for reducing processing and bandwidth used, the method comprising:
- a) for each of the documents, generating at least two fingerprints by i) extracting parts from the document, ii) hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) populating at least two lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list;
  
  b) determining whether or not the two documents are near-duplicate documents based on their fingerprints; and
  
  c) if it is determined that the two documents are near-duplicates, then indicating that one of the two documents is not to be processed during a subsequent crawl.

25. A method for treating broken links to document, the method comprising:
- a) determining whether a link to a first document is broken;
  
  b) if it is determined that a link to a first document is broken, determining whether there exists a second document that is a near-duplicate of the first document; and
  
  c) if it is determined that there exists a second document that is a near-duplicate of the first document, then replacing the broken link to the first document with a link to the second document, wherein the act of determining whether or not there exists a second document is a near-duplicate of the first document is performed by;
  
  i) for each of the documents, generating at least two fingerprints by A) extracting parts from the document, B) hashing each of the extracted parts to generate a hash value for each of the extracted parts, C) populating at least two lists with the extracted parts based on their respective hash values, and D) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
  
  ii) determining whether or not the two documents are near-duplicate documents based on their fingerprints.

26. An apparatus for determining whether two documents are near-duplicates, the apparatus comprising:
- a) a fingerprint generator for generating at least two fingerprints for each of the two documents, the fingerprint generator including i) an extractor for extracting parts from the document, ii) a hashing facility for hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) a list population facility for populating at least two lists with the extracted parts based on their respective hash values, and iv) means for determining, for each of the predetermined number of lists, a fingerprint based on the contents of the list; and
  
  b) a comparison facility for determining whether or not the two documents are near-duplicate documents based on their fingerprints.

27. An improved crawling facility, for reducing processing and bandwidth used, the crawling facility comprising:
- a) a fingerprint generator for generating, for each of the documents, at least two fingerprints, the fingerprint generator including i) an extractor for extracting parts from the document, ii) a hashing facility for hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) a list population facility for populating at least two lists with the extracted parts based on their respective hash values, and iv) means for determining, for each of the predetermined number of lists, a fingerprint based on the contents of the list;
  
  b) a comparison facility for determining whether or not the two documents are near-duplicate documents based on their fingerprints; and
  
  c) a document processor, wherein if it is determined that the two documents are near-duplicates, then the document processor indicates that one of the two documents is not to be processed during a subsequent crawl.

28. A machine-readable medium having stored thereon machine-executable instructions which, when executed by a machine:
- a) extract parts from a document, ii) hash each of the extracted parts to generate a hash value for each of the extracted parts, iii) populate a predetermined number of lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determine a fingerprint based on the contents of the list.

29. A method for generating at least two fingerprints for a document comprising:
- a) extracting parts from the document;
  
  b) hashing each of the extracted parts to generate a hash value for each of the extracted parts;
  
  c) populating a predetermined number of lists with the extracted parts based on their respective hash values; and
  
  d) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list.
- View Dependent Claims (30, 31)
- - 30. The method of claim 29 wherein each of the lists has an associated hashing function,
- 31. The method of claim 30 wherein for each hash function is dynamically adjusted such that the probability that the hash function will populate its associated list with a part decreases as the size of the document increases.

32. A method comprising:
- a) determining whether there exists a second document that is a near-duplicate of a first document; and
  
  b) indexing the first document but not the second document, wherein the act of determining whether or not there exists a second document is a near-duplicate of the first document is performed by;
  
  i) for each of the documents, generating at least two fingerprints by A) extracting parts from the document, B) hashing each of the extracted parts to generate a hash value for each of the extracted parts, C) populating at least two lists with the extracted parts based on their respective hash values, and D) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
  
  ii) determining whether or not the two documents are near-duplicate documents based on their fingerprints.

33. A method for determining whether two objects are near-duplicates, the method comprising:
- a) for each of the two objects, generating at least two fingerprints by i) extracting features from the object, ii) hashing each of the extracted features to generate a hash value for each of the extracted features, iii) populating at least two lists with the extracted features based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
  
  b) determining whether or not the two objects are near-duplicates based on their fingerprints.
- View Dependent Claims (34, 35, 36)
- - 34. The method of claim 33 wherein each of the two objects is a word, and
- 35. The method of claim 33 wherein each of the two objects is a word, andwherein, in each case, the extracted features are words that frequently occur in close proximity to the word.
- 36. The method of claim 33 wherein the two objects are words, andwherein if the two objects are determined to be near duplicates, then determining the two words to be synonyms.

37. A method for determining whether a first document and a second document in a collection of documents are near-duplicates, the method comprising:
- a) for each of the documents in the collection of documents, generating at least two fingerprints; and
  
  b) concluding that the first and second documents are near-duplicates if any one of the at least two fingerprints of the first document matches any one of the at least two fingerprints of the second document, wherein documents in the collection of documents without any common fingerprints are not checked to determine whether or not they are near duplicates.
- View Dependent Claims (38)
- - 38. The method of claim 37 further comprising:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Pugh, William, Henzinger, Monika H.
Primary Examiner(s)
Coby, Frantz
Assistant Examiner(s)
NGUYEN, MERILYN P

Application Number

US09/768,947
Time in Patent Office

1,042 Days
Field of Search

707/1, 707/4, 707/102, 707/203, 707/3, 707/6, 707/103
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9532   Query formulation

G06F 16/9538   Presentation of query results

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Detecting duplicate and near-duplicate files

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate and near-duplicate files

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links