Decreasing the fragility of duplicate document detecting algorithms
First Claim
Patent Images
1. A method used in detecting a duplicate document, the method comprising:
- generating at least a first and second lexicon of attributes, the second lexicon being generated by eliminating a random fraction of attributes from the first lexicon independently of a characteristic of the attributes in the first lexicon;
determining unique attributes in a document;
determining a first set of intersection attributes based on an intersection between the unique attributes in the document and the first lexicon;
determining a second set of intersection attributes based on an intersection between the unique attributes in the document and the second lexicon;
calculating a first sub-signature based on the first set of intersection attributes;
calculating a second sub-signature based on the second set of intersection attributions;
setting a signature of the document to include the first and second sub-signatures; and
accessing a second signature associated with a second document, the second signature including a third sub-signature calculated based on an intersection between unique attributes in the second document and the first lexicon, and a fourth sub-signature calculated based on an intersection between unique attributes in the second document and the second lexicon; and
if the first sub-signature matches the third sub-signature or if the second sub-signature matches the fourth sub-signature, determining that the document and the second document are duplicates.
7 Assignments
0 Petitions
Accused Products
Abstract
In a signature-based duplicate detection system, multiple different lexicons are used to generate a signature for a document that comprises multiple sub-signatures. The signature of an e-mail or other document may be defined as the set of signatures generated based on the multiple different lexicons. When a collection of sub-signatures is used as a document'"'"'s signature, two documents may be considered as being duplicates when a sub-signature generated based on a particular lexicon in the collection for the first document matches a signature generated based on the same lexicon in the collection for the second document.
54 Citations
18 Claims
-
1. A method used in detecting a duplicate document, the method comprising:
-
generating at least a first and second lexicon of attributes, the second lexicon being generated by eliminating a random fraction of attributes from the first lexicon independently of a characteristic of the attributes in the first lexicon; determining unique attributes in a document; determining a first set of intersection attributes based on an intersection between the unique attributes in the document and the first lexicon; determining a second set of intersection attributes based on an intersection between the unique attributes in the document and the second lexicon; calculating a first sub-signature based on the first set of intersection attributes; calculating a second sub-signature based on the second set of intersection attributions; setting a signature of the document to include the first and second sub-signatures; and accessing a second signature associated with a second document, the second signature including a third sub-signature calculated based on an intersection between unique attributes in the second document and the first lexicon, and a fourth sub-signature calculated based on an intersection between unique attributes in the second document and the second lexicon; and if the first sub-signature matches the third sub-signature or if the second sub-signature matches the fourth sub-signature, determining that the document and the second document are duplicates. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-usable medium storing a computer program used in detecting a duplicate document, the computer program comprising instructions for causing a computer to perform the following operations:
-
generate at least a first and second lexicon of attributes, the second lexicon being generated by eliminating a random fraction of attributes from the first lexicon independently of a characteristic of the attributes in the first lexicon; determine unique attributes in a document; determine a first set of intersection attributes based on an intersection between the unique attributes in the document and the first lexicon; determine a second set of intersection attributes based on an intersection between the unique attributes in the document and the second lexicon; calculate a first sub-signature based on the first set of intersection attributes; calculate a second sub-signature based on the second set of intersection attributions; set a signature of the document to include the first and second sub-signatures; and access a second signature associated with a second document, the second signature including a third sub-signature calculated based on an intersection between unique attributes in the second document and the first lexicon, and a fourth sub-signature calculated based on an intersection between unique attributes in the second document and the second lexicon; and if the first sub-signature matches the third sub-signature or if the second sub-signature matches the fourth sub-signature, determine that the document and the second document are duplicates. - View Dependent Claims (14, 15, 16, 17)
-
-
18. An apparatus used in detecting a duplicate document, the apparatus comprising:
-
means for generating at least a first and second lexicon of attributes, the second lexicon being generated by eliminating a random fraction of attributes from the first lexicon independently of a characteristic of the attributes in the first lexicon, one or more processing devices and a computer-readable medium coupled to the one or more processing devices, the medium storing instructions which, when executed by the one or more processing devices, cause the one or more processing devices to perform operations comprising; determining unique attributes in a document; determining a first set of intersection attributes based on an intersection between the unique attributes in the document and the first lexicon; determining a second set of intersection attributes based on an intersection between the unique attributes in the document and the second lexicon; calculating a first sub-signature based on the first set of intersection attributes; calculating a second sub-signature based on the second set of intersection attributions; setting a signature of the document to include the first and second sub-signatures; and accessing a second signature associated with a second document, the second signature including a third sub-signature calculated based on an intersection between unique attributes in the second document and the first lexicon, and a fourth sub-signature calculated based on an intersection between unique attributes in the second document and the second lexicon; and if the first sub-signature matches the third sub-signature or if the second sub-signature matches the fourth sub-signature, determining that the document and the second document are duplicates.
-
Specification