Detecting duplicate and near-duplicate files
First Claim
Patent Images
1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:
- a) for one of the plurality of candidate search results, determining whether the one candidate search result is a near-duplicate of another of the plurality of candidate search results by1) comparing a cluster identifier of the one candidate search result with a cluster identifier of the other candidate search result, and2) if the cluster identifiers of the one and the other candidate search results match, then concluding that the one candidate search is a near-duplicate of the other candidate search result; and
b) in response to a determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
2 Assignments
0 Petitions
Accused Products
Abstract
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.
108 Citations
2 Claims
-
1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:
-
a) for one of the plurality of candidate search results, determining whether the one candidate search result is a near-duplicate of another of the plurality of candidate search results by 1) comparing a cluster identifier of the one candidate search result with a cluster identifier of the other candidate search result, and 2) if the cluster identifiers of the one and the other candidate search results match, then concluding that the one candidate search is a near-duplicate of the other candidate search result; and b) in response to a determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
-
-
2. A search filter for processing a plurality of search results to remove near-duplicates, the search filter comprising:
-
a) a near-duplicate determination facility for determining, for one of the plurality of candidate search results, whether the one candidate search result is a near-duplicate of another of the plurality of candidate search results, and wherein the near-duplicate determination facility includes a comparison facility for comparing a cluster identifier of the one candidate search result with a cluster identifier of the other candidate search result, and wherein if the cluster identifiers of the one candidate search result and the other candidate search result match, then it is concluded that the one candidate search result and the other candidate search result are near-duplicates; and b) a filter for rejecting the one candidate search result if it is determined that the one candidate search result is a near-duplicate of the other candidate search result and passing the one candidate search result if it is not determined that the one candidate search result is a near-duplicate of the other candidate search result, thereby defining a filtered set of search results including only those of the plurality of candidate results that have not been rejected by the filter.
-
Specification