DETECTING DUPLICATE AND NEAR-DUPLICATE FILES
First Claim
Patent Images
1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:
- a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and
2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and
b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected.
1 Assignment
0 Petitions
Accused Products
Abstract
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.
54 Citations
21 Claims
-
1. A method for filtering a plurality of candidate search results to remove near-duplicates, the method comprising:
-
a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by 1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and 2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus, comprising:
-
at least one processor; and at least one storage device storing a processor executable program which, when executed by the at least one processor, causes the at least one processor to perform operations comprising; a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by 1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and 2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer storage device encoded with a computer program, the computer program comprising instructions that, when executed by data processing apparatus, cause the data processing apparatus to perform operations comprising:
-
a) for one or more of the plurality of candidate search results, determining that one candidate search result of the one or more candidate search results is a near-duplicate of another of the plurality of candidate search results by 1) determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result; and 2) in response to determining that a cluster identifier of the one candidate search result matches a cluster identifier of the other candidate search result, concluding that the one candidate search is a near-duplicate of the other candidate search result; and b) in response to the determination that the one candidate search result is a near-duplicate of the other candidate search result, rejecting the one candidate search result thereby defining a filtered set of search results including only those of the plurality of candidate search results that have not been rejected. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification