DETECTING DUPLICATE AND NEAR-DUPLICATE FILES
First Claim
Patent Images
1. A computer-implemented method for determining whether two documents are near-duplicates, the computer-implemented method comprising:
- a) for each of the two documents, generating at least two different fingerprints; and
b) determining whether or not the two documents are near-duplicate documents by1) determining whether or not any one of the at least two fingerprints of a first of the two documents matches any one of the at least two fingerprints of a second of the two documents, and2) if it is determined that any one fingerprint of the at least two fingerprints of the first of the two documents does match any one fingerprint of the at least two fingerprints of the second of the two documents, then concluding that the two documents are near-duplicates; and
c) using the determination of whether or not the two documents are near-duplicates in at least one of (A) an act of serving search results corresponding to documents, (B) an act of crawling documents, (C) an act of indexing documents, and (D) an act of fixing a broken link to at least one of the two documents.
2 Assignments
0 Petitions
Accused Products
Abstract
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.
164 Citations
14 Claims
-
1. A computer-implemented method for determining whether two documents are near-duplicates, the computer-implemented method comprising:
-
a) for each of the two documents, generating at least two different fingerprints; and b) determining whether or not the two documents are near-duplicate documents by 1) determining whether or not any one of the at least two fingerprints of a first of the two documents matches any one of the at least two fingerprints of a second of the two documents, and 2) if it is determined that any one fingerprint of the at least two fingerprints of the first of the two documents does match any one fingerprint of the at least two fingerprints of the second of the two documents, then concluding that the two documents are near-duplicates; and c) using the determination of whether or not the two documents are near-duplicates in at least one of (A) an act of serving search results corresponding to documents, (B) an act of crawling documents, (C) an act of indexing documents, and (D) an act of fixing a broken link to at least one of the two documents.
-
-
2. A machine-readable medium having stored thereon a plurality of records, each of the records comprising:
-
a) a first field for storing a document identifier; and b) a plurality of lists, each of the plurality of lists containing elements of a document identified by the document identifier stored in the first field, wherein each of the elements are contained in one of the plurality of lists in accordance with a result of hashing the element using a hash function. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. Apparatus for determining whether two documents are near-duplicates, the apparatus comprising:
-
a) at least one processor; and b) at least one storage device storing a processor executable program which, when executed by the at least one processor, performs a method including 1) for each of the two documents, generating at least two different fingerprints; and 2) determining whether or not the two documents are near-duplicate documents by A) determining whether or not any one of the at least two fingerprints of a first of the two documents matches any one of the at least two fingerprints of a second of the two documents, and B) if it is determined that any one fingerprint of the at least two fingerprints of the first of the two documents does match any one fingerprint of the at least two fingerprints of the second of the two documents, then concluding that the two documents are near-duplicates; and C) using the determination of whether or not the two documents are near-duplicates in at least one of (i) an act of serving search results corresponding to documents, (ii) an act of crawling documents, (iii) an act of indexing documents, and (iv) an act of fixing a broken link to at least one of the two documents.
-
Specification