Detecting duplicate and near-duplicate files

US 20080044016A1
Filed: 08/04/2006
Published: 02/21/2008
Est. Priority Date: 08/04/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for identifying near-duplicate documents, the method comprising:

a) accepting a set of documents;

b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; and

c) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Near-duplicate documents may be identified by processing an accepted set of documents to determine a first set of near-duplicate documents using a first technique, and processing the first set to determine a second set of near-duplicate documents using a second technique. The first technique might be token order dependent, and the second technique might be order independent. The first technique might be token frequency independent, and the second technique might be frequency dependent. The first technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first technique might use set intersection to determine whether or not documents are near-duplicates, and the second technique might use random projections to determine whether or not documents are near-duplicates.

Citations

29 Claims

1. A computer-implemented method for identifying near-duplicate documents, the method comprising:
- a) accepting a set of documents;
  
  b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; and
  
  c) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 2. The computer-implemented method of claim 1 wherein the first document similarity technique is token order dependent, andwherein the second document similarity technique is order independent.
  - 3. The computer-implemented method of claim 1 wherein the first document similarity technique is token frequency independent, andwherein the second document similarity technique is frequency dependent.
  - 4. The computer-implemented method of claim 1 wherein the first document similarity technique determines whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, andwherein the second document similarity technique determines whether two documents are near-duplicates using representations based on all of the words or tokens of the documents.
  - 5. The computer-implemented method of claim 1 wherein the first document similarity technique is order dependent and frequency independent, andwherein the second document similarity technique is order independent and frequency dependent.
  - 6. The computer-implemented method of claim 1 wherein the first document similarity technique uses set intersection to determine whether or not documents are near-duplicates, andwherein the second document similarity technique uses random projections to determine whether or not documents are near-duplicates.
  - 7. The computer-implemented method of claim 1 wherein the first document similarity technique includesfingerprinting every sub-sequence of k tokens to generate one of (A) (n−
    - k+1) shingles, or (B) n shingles,applying m different random permutation functions f_ifor 1≦
      
      i≦
      
      m to each of the shingles to generate one of (A) n−
      
      k+1 values, or (B) n values, for each of the m random permutation functions f_i.determining, for each i, the smallest value to creates an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents to be near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 8. The computer-implemented method of claim 1 wherein the first document similarity technique includesfingerprinting every sub-sequence of k tokens to generate one of (A) (n−
    - k+1) shingles, or (B) n shingles,fingerprinting each shingle with m different fingerprinting functions f_ifor 1≦
      
      i≦
      
      m to each of the shingles to generate one of (A) n−
      
      k+1 values, or (B) n values, for each of the m fingerprinting functions f_i.determining, for each i, the smallest value to creates an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′
      
      -dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents to be near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  - 9. The computer-implemented method of claim 8 wherein m=84, m′
    - =6 and k is any value from 5 to 10.
  - 10. The computer-implemented method of claim 9 wherein k=8.
  - 11. The computer-implemented method of claim 1 wherein the set of documents is a set of token sequence bit strings, andwherein the second document similarity technique includesprojecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
    - 1, 1},for each document,creating a b-dimensional vector by adding the projections of all the tokens in its token sequence,creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0, anddetermining a similarity between two documents based on a number of bits in which corresponding projections of the two documents agree.
  - 12. The computer-implemented method of claim 11 wherein b=384
  - 13. The computer-implemented method of claim 11 wherein b is from 100 to 384.
  - 14. The computer-implemented method of claim 11 wherein b is set such that a bit string of 48 bytes is stored per document.
  - 15. The computer-implemented method of claim 1 wherein the act of processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includesi) accepting the first set of near-duplicate documents,ii) for each pair of near duplicate documents in the first set,determining a similarity value using the second document similarity technique,if the determined similarity value is less than the threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, andiii) setting the second set to a most recent updated set of near-duplicate documents.
  - 16. The computer-implemented method of claim 15 wherein the set of documents is a set of token sequence bit strings, andwherein the second document similarity technique includesprojecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
    - 1, 1},for each document,creating a b-dimensional vector by adding the projections of all the tokens in its token sequence,creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0,determining a similarity between two documents based on a number of bits in which corresponding projections of the two documents agree.
  - 17. The computer-implemented method of claim 16 wherein the predetermined number b is 384 and wherein the threshold is set to 372.
  - 18. The computer-implemented method of claim 16 wherein the threshold is set to approximately 97% of the predetermined number b.
  - 19. The computer-implemented method of claim 16 wherein the threshold is set to at least 96% of the predetermined number b.
  - 20. The computer-implemented method of claim 1 wherein the set of documents is a set of token sequence bit strings, each of which token sequence bit strings was generated from a Web page.
  - 21. The computer-implemented method of claim 1 wherein the first document similarity technique requires less processing time than the second document similarity technique.
  - 22. The computer-implemented method of claim 1 wherein the first document similarity technique requires less storage to run than the second document similarity technique.
  - 23. The computer-implemented method of claim 1 wherein the second document similarity technique can be tuned to a finer degree than the first document similarity technique.
  - 24. The computer-implemented method of claim 1 further comprising removing boilerplate from the accepted set of documents to generate a set of preprocessed documents, wherein the act processing the set of documents to determine an initial set of near-duplicate documents using a first document similarity technique operates on the set of preprocessed documents.
  - 25. The computer-implemented method of claim 1 further comprising:
    - d) processing the set of documents to determine a third set of near-duplicate documents using the second document similarity technique; and
      
      e) determining a fourth set of near duplicate documents by determining the union of the second set of near duplicate document and the third set of near-duplicate documents.
  - 26. The computer-implemented method of claim 25 wherein the set of documents is a set of token sequence bit strings, andwherein the second document similarity technique includesprojecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−
    - 1, 1},for each document,creating a b-dimensional vector by adding the projections of all the tokens in its token sequence,creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0,determining a similarity between two documents based on a number of bits in which corresponding projections of the two documents agree.

27. A computer-implemented method for identifying near-duplicate documents, the method comprising:
- a) accepting a set of documents; and
  
  b) processing the set of documents to determine near-duplicate documents,wherein a first document similarity technique is used to determine near-duplicate documents for documents from the same Website, and wherein a second document similarity technique is used to determine near-duplicate documents for documents from different Websites.

28. A machine-readable medium having stored thereon machine-executable instructions which, when executed by a machine, perform a method comprising:
- a) accepting a set of documents;
  
  b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; and
  
  c) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique.

29. The machine-readable medium of claim 29 wherein when the machine-executable instructions are executed by a machine, the act of processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includesi) accepting the first set of near-duplicate documents,ii) for each pair of near duplicate documents in the first set,determining a similarity value using the second document similarity technique,if the determined similarity value is less than the threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, andiii) setting the second set to a most recent updated set of near-duplicate documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Henzinger, Monika H.

Granted Patent

US 8,015,162 B2
Time in Patent Office

Days
Field of Search
US Class Current

380/201
CPC Class Codes

G06F 16/958 Organisation or management ...

G06F 40/194 Calculation of difference b...

Detecting duplicate and near-duplicate files

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate and near-duplicate files

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links