Detecting duplicate and near-duplicate files
First Claim
Patent Images
1. A method for determining whether documents, in a large collection of documents, are near-duplicates, the method comprising:
- a) for each of at least some of the documents in the large collection of documents, generating at least two fingerprints;
b) preprocessing the fingerprints to identify any fingerprints that are associated with only one document; and
c) determining whether or not documents are near-duplicate documents based on fingerprints other than those identified as being associated with only one document.
2 Assignments
0 Petitions
Accused Products
Abstract
Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.
-
Citations
38 Claims
-
1. A method for determining whether documents, in a large collection of documents, are near-duplicates, the method comprising:
-
a) for each of at least some of the documents in the large collection of documents, generating at least two fingerprints;
b) preprocessing the fingerprints to identify any fingerprints that are associated with only one document; and
c) determining whether or not documents are near-duplicate documents based on fingerprints other than those identified as being associated with only one document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
i) for any two documents, determining whether or not any fingerprints of a first of the two documents matches any fingerprints of a second of the two documents, and ii) if it is determined that a fingerprint of the first of the two documents does match a fingerprint of the second of the two documents, then concluding that the two documents are near-duplicates.
-
-
3. The method of claim 1 wherein the act of generating at least two fingerprints for each of the documents includes:
-
i) extracting parts from the document, ii) hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) populating a predetermined number of lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list.
-
-
4. The method of claim 3 wherein the act of hashing each of the extracted parts to generate a hash value for each of the extracted parts uses a hash function that is repeatable, deterministic and not sensitive to state.
-
5. The method of claim 3 wherein the parts extracted from the document are selected from a group of parts consisting of characters, words, sentences, paragraphs and sections.
-
6. The method of claim 3 wherein the parts extracted from the document do not overlap.
-
7. The method of claim 3 wherein the parts extracted from the document overlap.
-
8. The method of claim 3 wherein each of the acts of determining a fingerprint uses a hashing function with a low probability of collision.
-
9. The method of claim 3 wherein the act of determining a fingerprint uses a function that is sensitive to an order of the parts within a list.
-
10. The method of claim 3 wherein the act of determining a fingerprint uses a function that is insensitive to an order of the parts within a list.
-
11. An apparatus for determining whether documents, in a large collection of documents, are near-duplicates, the apparatus comprising:
-
a) a fingerprint generator for generating, for each of the documents in the large collection of documents, at least two fingerprints;
b) a preprocessor for identifying any fingerprints that are associated with only one document; and
c) a fingerprint comparison facility for determining whether or not documents are near-duplicate documents based on fingerprints other than those identified as being associated with only one document. - View Dependent Claims (12)
i) an extractor for extracting parts from the document, ii) a hashing facility for hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) list population facility for populating a predetermined number of lists with the extracted parts based on their respective hash values, and iv) means for determining a fingerprint for each of the predetermined number of lists, based on the contents of the list.
-
-
13. A method for clustering documents, the method comprising:
-
a) for each of the documents, generating at least two fingerprints; and
b) for each of the documents, i) determining whether or not the document is a near-duplicate of any of previously processed documents, based on fingerprints of the documents, ii) if it is determined that the document is not a near-duplicate of any previously processed document, then associating the document with a unique cluster identifier, and iii) if it is determined that the document is a near-duplicate of a previously processed document, then associating the document with a cluster identifier associated with the previously processed document.
-
-
14. A method for filtering search results to remove near-duplicates, the method comprising:
-
a) for each of a predetermined number of candidate search results, determining whether the candidate search result is a near-duplicate of another candidate search result; and
b) if it is determined that the candidate search result is a near-duplicate of another candidate search result, then rejecting the candidate search result wherein the act of determining whether a candidate search result is a near-duplicate of another candidate search result includes i) comparing a cluster identifier of the candidate search result with that of the other candidate search result, and ii) if the cluster identifiers of the two candidate search results match, then concluding that the two candidate search results are near-duplicates, and wherein cluster identifiers of the candidate search results are assigned by;
i) determining whether or not a document corresponding to the candidate search result is a near-duplicate of any of previously processed documents, ii) if it is determined that the document corresponding to the candidate search result is not a near-duplicate of any previously processed document, then associating the document with a unique cluster identifier, and iii) if it is determined that the document corresponding to the candidate search result is a near-duplicate of a previously processed document, then associating the document corresponding to the candidate search result with a cluster identifier associated with the previously processed document.
-
-
15. A method for determining whether two documents are near-duplicates, the method comprising:
-
a) for each of the two documents, generating at least two fingerprints by i) extracting parts from the document, ii) hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) populating at least two lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
b) determining whether or not the two documents are near-duplicate documents based on their fingerprints. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
i) determining whether or not any fingerprints of a first of the two documents matches any fingerprints of a second of the two documents, and ii) if it is determined that a fingerprint of the first of the two documents does match a fingerprint of the second of the two documents, then concluding that the two documents are near-duplicates.
-
-
17. The method of claim 15 wherein the act of hashing each of the extracted parts to generate a hash value for each of the extracted parts uses a hash function that is repeatable, deterministic and not sensitive to state.
-
18. The method of claim 15 wherein the parts extracted from the document are selected from a group of parts consisting of characters, words, sentences, paragraphs and sections.
-
19. The method of claim 15 wherein the parts extracted from the document do not overlap.
-
20. The method of claim 15 wherein the parts extracted from the document overlap.
-
21. The method of claim 15 wherein the act of determining a fingerprint uses a hashing function with a low probability of collision.
-
22. The method of claim 15 wherein the act of determining a fingerprint uses a function that is sensitive to an order of the parts within a list.
-
23. The method of claim 15 wherein the act of determining a fingerprint uses a function that is insensitive to an order of the parts within a list.
-
24. A method, for use in a crawling facility, for reducing processing and bandwidth used, the method comprising:
-
a) for each of the documents, generating at least two fingerprints by i) extracting parts from the document, ii) hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) populating at least two lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list;
b) determining whether or not the two documents are near-duplicate documents based on their fingerprints; and
c) if it is determined that the two documents are near-duplicates, then indicating that one of the two documents is not to be processed during a subsequent crawl.
-
-
25. A method for treating broken links to document, the method comprising:
-
a) determining whether a link to a first document is broken;
b) if it is determined that a link to a first document is broken, determining whether there exists a second document that is a near-duplicate of the first document; and
c) if it is determined that there exists a second document that is a near-duplicate of the first document, then replacing the broken link to the first document with a link to the second document, wherein the act of determining whether or not there exists a second document is a near-duplicate of the first document is performed by;
i) for each of the documents, generating at least two fingerprints by A) extracting parts from the document, B) hashing each of the extracted parts to generate a hash value for each of the extracted parts, C) populating at least two lists with the extracted parts based on their respective hash values, and D) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
ii) determining whether or not the two documents are near-duplicate documents based on their fingerprints.
-
-
26. An apparatus for determining whether two documents are near-duplicates, the apparatus comprising:
-
a) a fingerprint generator for generating at least two fingerprints for each of the two documents, the fingerprint generator including i) an extractor for extracting parts from the document, ii) a hashing facility for hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) a list population facility for populating at least two lists with the extracted parts based on their respective hash values, and iv) means for determining, for each of the predetermined number of lists, a fingerprint based on the contents of the list; and
b) a comparison facility for determining whether or not the two documents are near-duplicate documents based on their fingerprints.
-
-
27. An improved crawling facility, for reducing processing and bandwidth used, the crawling facility comprising:
-
a) a fingerprint generator for generating, for each of the documents, at least two fingerprints, the fingerprint generator including i) an extractor for extracting parts from the document, ii) a hashing facility for hashing each of the extracted parts to generate a hash value for each of the extracted parts, iii) a list population facility for populating at least two lists with the extracted parts based on their respective hash values, and iv) means for determining, for each of the predetermined number of lists, a fingerprint based on the contents of the list;
b) a comparison facility for determining whether or not the two documents are near-duplicate documents based on their fingerprints; and
c) a document processor, wherein if it is determined that the two documents are near-duplicates, then the document processor indicates that one of the two documents is not to be processed during a subsequent crawl.
-
-
28. A machine-readable medium having stored thereon machine-executable instructions which, when executed by a machine:
-
a) extract parts from a document, ii) hash each of the extracted parts to generate a hash value for each of the extracted parts, iii) populate a predetermined number of lists with the extracted parts based on their respective hash values, and iv) for each of the predetermined number of lists, determine a fingerprint based on the contents of the list.
-
-
29. A method for generating at least two fingerprints for a document comprising:
-
a) extracting parts from the document;
b) hashing each of the extracted parts to generate a hash value for each of the extracted parts;
c) populating a predetermined number of lists with the extracted parts based on their respective hash values; and
d) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list. - View Dependent Claims (30, 31)
wherein each of the extracted parts can be contained in none of the lists, one of the lists, or more of the lists based on the hash functions for the lists. -
31. The method of claim 30 wherein for each hash function is dynamically adjusted such that the probability that the hash function will populate its associated list with a part decreases as the size of the document increases.
-
-
32. A method comprising:
-
a) determining whether there exists a second document that is a near-duplicate of a first document; and
b) indexing the first document but not the second document, wherein the act of determining whether or not there exists a second document is a near-duplicate of the first document is performed by;
i) for each of the documents, generating at least two fingerprints by A) extracting parts from the document, B) hashing each of the extracted parts to generate a hash value for each of the extracted parts, C) populating at least two lists with the extracted parts based on their respective hash values, and D) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
ii) determining whether or not the two documents are near-duplicate documents based on their fingerprints.
-
-
33. A method for determining whether two objects are near-duplicates, the method comprising:
-
a) for each of the two objects, generating at least two fingerprints by i) extracting features from the object, ii) hashing each of the extracted features to generate a hash value for each of the extracted features, iii) populating at least two lists with the extracted features based on their respective hash values, and iv) for each of the predetermined number of lists, determining a fingerprint based on the contents of the list; and
b) determining whether or not the two objects are near-duplicates based on their fingerprints. - View Dependent Claims (34, 35, 36)
wherein the extracted features define context vectors. -
35. The method of claim 33 wherein each of the two objects is a word, and
wherein, in each case, the extracted features are words that frequently occur in close proximity to the word. -
36. The method of claim 33 wherein the two objects are words, and
wherein if the two objects are determined to be near duplicates, then determining the two words to be synonyms.
-
-
37. A method for determining whether a first document and a second document in a collection of documents are near-duplicates, the method comprising:
-
a) for each of the documents in the collection of documents, generating at least two fingerprints; and
b) concluding that the first and second documents are near-duplicates if any one of the at least two fingerprints of the first document matches any one of the at least two fingerprints of the second document, wherein documents in the collection of documents without any common fingerprints are not checked to determine whether or not they are near duplicates. - View Dependent Claims (38)
a2) for each of the documents in the collection of documents, generating a document-fingerprint pair for each of the at least two fingerprints; and
a3) sorting the fingerprint-document pairs based on values of the fingerprints.
-
Specification