Detecting duplicate and near-duplicate files
First Claim
1. A computer-implemented method for identifying near-duplicate documents, the method comprising:
- a) accepting a set of documents;
b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; and
c) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique.
3 Assignments
0 Petitions
Accused Products
Abstract
Near-duplicate documents may be identified by processing an accepted set of documents to determine a first set of near-duplicate documents using a first technique, and processing the first set to determine a second set of near-duplicate documents using a second technique. The first technique might be token order dependent, and the second technique might be order independent. The first technique might be token frequency independent, and the second technique might be frequency dependent. The first technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first technique might use set intersection to determine whether or not documents are near-duplicates, and the second technique might use random projections to determine whether or not documents are near-duplicates.
-
Citations
29 Claims
-
1. A computer-implemented method for identifying near-duplicate documents, the method comprising:
-
a) accepting a set of documents; b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; and c) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A computer-implemented method for identifying near-duplicate documents, the method comprising:
-
a) accepting a set of documents; and b) processing the set of documents to determine near-duplicate documents, wherein a first document similarity technique is used to determine near-duplicate documents for documents from the same Website, and wherein a second document similarity technique is used to determine near-duplicate documents for documents from different Websites.
-
-
28. A machine-readable medium having stored thereon machine-executable instructions which, when executed by a machine, perform a method comprising:
-
a) accepting a set of documents; b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; and c) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique.
-
-
29. The machine-readable medium of claim 29 wherein when the machine-executable instructions are executed by a machine, the act of processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includes
i) accepting the first set of near-duplicate documents, ii) for each pair of near duplicate documents in the first set, determining a similarity value using the second document similarity technique, if the determined similarity value is less than the threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, and iii) setting the second set to a most recent updated set of near-duplicate documents.
Specification