System and method for detecting content similarity within email documents by sparse subset hashing
First Claim
1. A method for detecting content similarity in email documents comprising:
- generating a token value for each of a plurality of character sequences of a first email document;
generating a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value;
selecting a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values;
selecting a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset;
generating a first hash value corresponding to the selected first subset of character sequences;
generating a second hash value corresponding to the selected second subset of character sequences; and
comparing the first and second hash values with one or more hash values generated from a second email document.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for detecting content similarity in email documents are disclosed. In one embodiment, a method comprises generating a first token value for each of a plurality of character sequences of a first email document, selecting a first subset of the plurality of character sequences based on the first token values, and generating one or more hash values corresponding to the selected first subset of character sequences. The method further comprises generating a second token value for each of a plurality of character sequences of a second email document, selecting a second subset of the plurality of character sequences based on the second token values, and generating one or more hash values corresponding to the selected second subset of character sequences. The method additionally comprises comparing the one or more hash values corresponding to the selected first subset with the one or more hash values corresponding to the selected second subset.
17 Citations
19 Claims
-
1. A method for detecting content similarity in email documents comprising:
-
generating a token value for each of a plurality of character sequences of a first email document; generating a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value; selecting a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values; selecting a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset; generating a first hash value corresponding to the selected first subset of character sequences; generating a second hash value corresponding to the selected second subset of character sequences; and comparing the first and second hash values with one or more hash values generated from a second email document. - View Dependent Claims (2, 3, 4)
-
-
5. A non-transitory computer-readable medium storing program instructions that are computer-executable to:
-
generate a token value for each of a plurality of character sequences of a first email document; generate a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value; select a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values; select a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset generate a first hash value corresponding to the selected first subset of character sequences; generate a second hash value corresponding to the selected second subset of character sequences; and compare the first and second hash values with one or more hash values generated from a second email document. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
-
-
13. A system for detecting content similarity in email documents comprising:
-
one or more processors; and memory storing program instructions that are executable by the one or more processors to; generate a token value for each of a plurality of character sequences of a first email document; generate a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value; select a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values; select a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset; generate a first hash value corresponding to the selected first subset of character sequences; generate a second hash value corresponding to the selected second subset of character sequences; and compare the first and second hash values with one or more hash values generated from a second email document. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
Specification