×

System and method for detecting content similarity within email documents by sparse subset hashing

  • US 8,275,842 B2
  • Filed: 03/31/2008
  • Issued: 09/25/2012
  • Est. Priority Date: 09/30/2007
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method for detecting content similarity in email documents comprising:

  • generating a token value for each of a plurality of character sequences of a first email document;

    generating a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value;

    selecting a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values;

    selecting a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset;

    generating a first hash value corresponding to the selected first subset of character sequences;

    generating a second hash value corresponding to the selected second subset of character sequences; and

    comparing the first and second hash values with one or more hash values generated from a second email document.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×