Detecting duplicate and near-duplicate files
First Claim
1. A computer-implemented method comprising:
- crawling documents accessible on a network to identify a set of documents, each document in the set of documents comprising a set of token sequence bit strings;
processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent;
processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value;
processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and
removing a final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents.
3 Assignments
0 Petitions
Accused Products
Abstract
Near-duplicate documents may be identified by processing an accepted set of documents to determine a first set of near-duplicate documents using a first technique, and processing the first set to determine a second set of near-duplicate documents using a second technique. The first technique might be token order dependent, and the second technique might be order independent. The first technique might be token frequency independent, and the second technique might be frequency dependent. The first technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first technique might use set intersection to determine whether or not documents are near-duplicates, and the second technique might use random projections to determine whether or not documents are near-duplicates.
64 Citations
63 Claims
-
1. A computer-implemented method comprising:
-
crawling documents accessible on a network to identify a set of documents, each document in the set of documents comprising a set of token sequence bit strings; processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent; processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value; processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and removing a final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A non-transitory machine readable medium having stored thereon machine-executable instructions which, when executed by a machine, cause the machine to perform operations comprising:
-
crawling documents accessible on a network to identify a set of documents, each document comprising a set of token sequence bit strings; processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent; processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value; processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and removing a final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 50, 51, 52, 53)
-
-
43. A system comprising:
one or more computers programmed to perform operations comprising; crawling documents accessible on a network to identify a set of documents, each document comprising a set of token sequence bit strings; processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique, wherein the first document similarity technique is token order dependent and token frequency independent; processing the first set of near-duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique, wherein the second document similarity technique is token order independent and token frequency dependent, and the second set of near-duplicate documents are determined based on a first threshold value; processing the set of documents to identify a third set of near-duplicate documents using the second document similarity technique, and wherein the third set of near-duplicate documents are identified based on a second threshold value greater than the first threshold value; and removing the final set of near-duplicate documents from the set of documents, and then indexing any remaining documents in the set of documents, wherein the final set of near-duplicate documents is a union of the second set of near-duplicate documents and the third set of near-duplicate documents. - View Dependent Claims (44, 45, 46, 47, 48, 49, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63)
Specification