Detecting duplicate and near-duplicate files

US 8,015,162 B2
Filed: 08/04/2006
Issued: 09/06/2011
Est. Priority Date: 08/04/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

crawling documents accessible on a network to identify a set of documents, each document in the set of documents comprising a set of token sequence bit strings;

processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent;

processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value;

processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and

removing a final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Near-duplicate documents may be identified by processing an accepted set of documents to determine a first set of near-duplicate documents using a first technique, and processing the first set to determine a second set of near-duplicate documents using a second technique. The first technique might be token order dependent, and the second technique might be order independent. The first technique might be token frequency independent, and the second technique might be frequency dependent. The first technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first technique might use set intersection to determine whether or not documents are near-duplicates, and the second technique might use random projections to determine whether or not documents are near-duplicates.

64 Citations

View as Search Results

63 Claims

1. A computer-implemented method comprising:
- crawling documents accessible on a network to identify a set of documents, each document in the set of documents comprising a set of token sequence bit strings;
  
  processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent;
  
  processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value;
  
  processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and
  
  removing a final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The computer-implemented method of claim 1 wherein the first document similarity technique determines whether two documents are near-duplicates using representations based on a subset of the tokens of the documents, andwherein the second document similarity technique determines whether two documents are near-duplicates using representations based on all of the tokens of the documents.
  - 3. The computer-implemented method of claim 1 wherein the first document similarity technique uses set intersection to determine whether or not documents are near-duplicates, andwherein the second document similarity technique uses random projections to determine whether or not documents are near-duplicates.
  - 4. The computer-implemented method of claim 1 wherein the first document similarity technique includes:
    - fingerprinting every sub-sequence of k tokens to generate one of (A) (n−
      
      k+1) shingles, or (B) n shingles,applying m different random permutation functions f_ifor 1 ≦
      
      i ≦
      
      m to each of the shingles to generate one of (A) n−
      
      k+1 values, or (B) n values, for each of the m random permutation functions f_i,determining, for each i, the smallest value to create an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents are near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 5. The computer-implemented method of claim 1 wherein the first document similarity technique includes:
    - fingerprinting every sub-sequence of k tokens to generate one of (A) (n−
      
      k+1) shingles, or (B) n shingles,fingerprinting each shingle by applying m different fingerprinting functions f_ifor 1 ≦
      
      i≦
      
      m to each of the shingles to generate one of (A) n−
      
      k+1 values, or (B) n values, for each of the m fingerprinting functions f_i,determining, for each i, the smallest value to create an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents are near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 6. The computer-implemented method of claim 5 wherein m=84, m′
    - =6 and k is any value from 5 to 10.
  - 7. The computer-implemented method of claim 6 wherein k=8.
  - 8. The computer-implemented method of claim 1 wherein the second document similarity technique includes:
    - projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
      
      1, 1},for each document;
      
      creating a b-dimensional vector for the document by adding the projections of all the tokens in a token sequence of the document,creating a final vector for the document by setting every positive entry in the b-dimensional vector for the document to 1 and every non-positive entry to 0, anddetermining a similarity between two documents based on a number of bits in which the corresponding final vectors of the two documents agree.
  - 9. The computer-implemented method of claim 8 wherein b=384.
  - 10. The computer-implemented method of claim 8 wherein b is from 100 to 384.
  - 11. The computer-implemented method of claim 8 wherein b is set such that a bit string of 48 bytes is stored per document.
  - 12. The computer-implemented method of claim 1 wherein the act of processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includes:
    - accepting the first set of near-duplicate documents,for each pair of near-duplicate documents in the first set,determining a similarity value using the second document similarity technique,if the determined similarity value is less than a threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, andsetting the second set to a most recent updated set of near-duplicate documents.
  - 13. The computer-implemented method of claim 12 wherein the second document similarity technique includes:
    - projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
      
      1, 1},for each document,creating a b-dimensional vector for the document by adding the projections of all the tokens in a token sequence of the document,creating a final vector for the document by setting every positive entry in the b-dimensional vector for the document to 1 and every non-positive entry to 0,determining a similarity between two documents based on a number of bits in which the corresponding final vectors of the two documents agree.
  - 14. The computer-implemented method of claim 13 wherein the predetermined number b is 384 and wherein the threshold is set to 372.
  - 15. The computer-implemented method of claim 13 wherein the threshold is set to approximately 97% of the predetermined number b.
  - 16. The computer-implemented method of claim 13 wherein the threshold is set to at least 96% of the predetermined number b.
  - 17. The computer-implemented method of claim 1 wherein each token sequence bit string was generated from a Web page.
  - 18. The computer-implemented method of claim 1 wherein the first document similarity technique requires less processing time than the second document similarity technique.
  - 19. The computer-implemented method of claim 1 wherein the first document similarity technique requires less storage to run than the second document similarity technique.
  - 20. The computer-implemented method of claim 1 wherein the second document similarity technique can be tuned to a finer degree than the first document similarity technique.
  - 21. The computer-implemented method of claim 1 further comprising removing boilerplate from the accepted set of documents to generate a set of preprocessed documents, wherein the act processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique operates on the set of preprocessed documents.

22. A non-transitory machine readable medium having stored thereon machine-executable instructions which, when executed by a machine, cause the machine to perform operations comprising:
- crawling documents accessible on a network to identify a set of documents, each document comprising a set of token sequence bit strings;
  
  processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent;
  
  processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value;
  
  processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and
  
  removing a final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 50, 51, 52, 53)
- - 23. The non-transitory machine readable medium of claim 22 wherein the first document similarity technique determines whether two documents are near-duplicates using representations based on a subset of the tokens of the documents, andwherein the second document similarity technique determines whether two documents are near-duplicates using representations based on all of the tokens of the documents.
  - 24. The non-transitory machine readable medium of claim 22 wherein the first document similarity technique uses set intersection to determine whether or not documents are near-duplicates, andwherein the second document similarity technique uses random projections to determine whether or not documents are near-duplicates.
  - 25. The non-transitory machine readable medium of claim 22 wherein the first document similarity technique includes:
    - fingerprinting every sub-sequence of k tokens to generate one of (A) (n−
      
      k+1) shingles, or (B) n shingles,applying m different random permutation functions f_ifor 1 ≦
      
      i≦
      
      m to each of the shingles to generate one of (A) n−
      
      k+1 values, or (B) n values, for each of the m random permutation functions f_i,determining, for each i, the smallest value to create an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents are near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 26. The non-transitory machine readable medium of claim 22 wherein the first document similarity technique includes:
    - fingerprinting every sub-sequence of k tokens to generate one of (A) (n−
      
      k+1) shingles, or (B) n shingles,fingerprinting each shingle by applying m different fingerprinting functions f_ifor 1 ≦
      
      i≦
      
      m to each of the shingles to generate one of (A) n−
      
      k +1 values, or (B) n values, for each of the m fingerprinting functions f_i,determining, for each i, the smallest value to create an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents are near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 27. The non-transitory machine readable medium of claim 26 wherein m=84, m′
    - =6 and k is any value from 5 to 10.
  - 28. The non-transitory machine readable medium of claim 27 wherein k=8.
  - 29. The non-transitory machine readable medium of claim 22 wherein the second document similarity technique includes:
    - projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
      
      1, 1},for each document;
      
      creating a b-dimensional vector for the document by adding the projections of all the tokens in a token sequence of the document,creating a final vector for the document by setting every positive entry in the b-dimensional vector for the document to 1 and every non-positive entry to 0, anddetermining a similarity between two documents based on a number of bits in which the corresponding final vectors of the two documents agree.
  - 30. The non-transitory machine readable medium of claim 29 wherein b=384.
  - 31. The non-transitory machine readable medium of claim 29 wherein b is from 100 to 384.
  - 32. The non-transitory machine readable medium of claim 29 wherein b is set such that a bit string of 48 bytes is stored per document.
  - 33. The non-transitory machine readable medium of claim 22 wherein when the machine-executable instructions are executed by a machine, the act of processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includes:
    - accepting the first set of near-duplicate documents,for each pair of near-duplicate documents in the first set,determining a similarity value using the second document similarity technique,if the determine similarity value is less than a threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, andsetting the second set to a most recent updated set of near-duplicate documents.
  - 34. The non-transitory machine readable medium of claim 33 wherein the second document similarity technique includes:
    - projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
      
      1, 1},for each document,creating a b-dimensional vector for the document by adding the projections of all the tokens in a token sequence of the document,creating a final vector for the document by setting every positive entry in the b-dimensional vector for the document to 1 and every non-positive entry to 0,determining a similarity between two documents based on a number of bits in which the corresponding final vectors of the two documents agree.
  - 35. The non-transitory machine readable medium of claim 34 wherein the predetermined number b is 384 and wherein the threshold is set to 372.
  - 36. The non-transitory machine readable medium of claim 34 wherein the threshold is set to approximately 97% of the predetermined number b.
  - 37. The non-transitory machine readable medium of claim 34 wherein the threshold is set to at least 96% of the predetermined number b.
  - 38. The non-transitory machine readable medium of claim 22 wherein each token sequence bit string was generated from a Web page.
  - 39. The non-transitory machine readable medium of claim 22 wherein the first document similarity technique requires less processing time than the second document similarity technique.
  - 40. The non-transitory machine readable medium of claim 22 wherein the first document similarity technique requires less storage to run than the second document similarity technique.
  - 41. The non-transitory machine readable medium of claim 22 wherein the second document similarity technique can be tuned to a finer degree than the first document similarity technique.
  - 42. The non-transitory machine readable medium of claim 22 wherein the operations further comprise removing boilerplate from the accepted set of documents to generate a set of preprocessed documents, wherein the act processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique operates on the set of preprocessed documents.
  - 50. The system of claim 24 wherein the second document similarity technique includes:
    - projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
      
      1, 1},for each document;
      
      creating a b-dimensional vector for the document by adding the projections of all the tokens in a token sequence of the document,creating a final vector for the document by setting every positive entry in the b-dimensional vector for the document to 1 and every non-positive entry to 0, anddetermining a similarity between two documents based on a number of bits in which the corresponding final vectors of the documents agree.
  - 51. The system of claim 50 wherein b=384.
  - 52. The system of claim 50 wherein b is from 100 to 384.
  - 53. The system of claim 50 wherein b is set such that a bit string of 48 bytes is stored per document.

43. A system comprising:
- one or more computers programmed to perform operations comprising;
  
  crawling documents accessible on a network to identify a set of documents, each document comprising a set of token sequence bit strings;
  
  processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent;
  
  processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value;
  
  processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and
  
  removing the final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents.
- View Dependent Claims (44, 45, 46, 47, 48, 49, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63)
- - 44. The system of claim 43 wherein the first document similarity technique determines whether two documents are near-duplicates using representations based on a subset of the tokens of the documents, andwherein the second document similarity technique determines whether two documents are near-duplicates using representations based on all of the tokens of the documents.
  - 45. The system of claim 43 wherein the first document similarity technique uses set intersection to determine whether or not documents are near-duplicates, andwherein the second document similarity technique uses random projections to determine whether or not documents are near-duplicates.
  - 46. The system of claim 43 wherein the first document similarity technique includes:
    - fingerprinting every sub-sequence of k tokens to generate one of (A) (n−
      
      k+1) shingles, or (B) n shingles,applying m different random permutation functions f_ifor 1≦
      
      i≦
      
      m to each of the shingles to generate one of (A) n−
      
      k+1 values, or (B) n values, for each of the m random permutation functions f_i,determining, for each i, the smallest value to create an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents are near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 47. The system of claim 43 wherein the first document similarity technique includes:
    - fingerprinting every sub-sequence of tokens to generate one of (A) (n−
      
      k+1) shingles, or (B) n shingles,fingerprinting each shingle by applying m different fingerprinting functions f_ifor 1≦
      
      i≦
      
      m to each of the shingles to generate one of (A) n−
      
      k+1 values, or (B) n values, for each of the m fingerprinting functions f_i,determining, for each i, the smallest value to create an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents are near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 48. The system of claim 47 wherein m=84, m′
    - =6 and k is any value from 5 to 10.
  - 49. The system of claim 48 wherein k=8.
  - 54. The system of claim 43 wherein processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includes:
    - accepting the first set of near-duplicate documents,for each pair of near-duplicate documents in the first set,determining a similarity value using the second document similarity technique,if the determined similarity value is less than a threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, andsetting the second set to a most recent updated set of near-duplicate documents.
  - 55. The system of claim 54 wherein the second document similarity technique includes:
    - projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
      
      1, 1},for each document,creating a b-dimensional vector for the document by adding the projections of all the tokens in a token sequence of the document,creating a final vector for the document by setting every positive entry in the b-dimensional vector for the document to 1 and every non-positive entry to 0,determining a similarity between two documents based on a number of bits in which the corresponding final vectors of the two documents agree.
  - 56. The system of claim 55 wherein the predetermined number b is 384 and wherein the threshold is set to 372.
  - 57. The system of claim 55 wherein the threshold is set to approximately 97% of the predetermined number b.
  - 58. The system of claim 55 wherein the threshold is set to at least 96% of the predetermined number b.
  - 59. The system of claim 43 wherein each token sequence bit string was generated from a Web page.
  - 60. The system of claim 43 wherein the first document similarity technique requires less processing time than the second document similarity technique.
  - 61. The system of claim 43 wherein the first document similarity technique requires less storage to run than the second document similarity technique.
  - 62. The system of claim 43 wherein the second document similarity technique can be tuned to a finer degree than the first document similarity technique.
  - 63. The system of claim 43 further programmed to perform operations comprising removing boilerplate from the accepted set of documents to generate a set of preprocessed documents, wherein the act processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique operates on the set of preprocessed documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Henzinger, Monika H.
Primary Examiner(s)
Popham; Jeffrey D

Application Number

US11/499,260
Publication Number

US 20080044016A1
Time in Patent Office

1,859 Days
Field of Search

707/664, 707/692, 707/749, 726/32
US Class Current

707/692
CPC Class Codes

G06F 16/958 Organisation or management ...

G06F 40/194 Calculation of difference b...

Detecting duplicate and near-duplicate files

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

64 Citations

63 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate and near-duplicate files

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

64 Citations

63 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links