System and method for detecting content similarity within email documents by sparse subset hashing

US 8,275,842 B2
Filed: 03/31/2008
Issued: 09/25/2012
Est. Priority Date: 09/30/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A method for detecting content similarity in email documents comprising:

generating a token value for each of a plurality of character sequences of a first email document;

generating a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value;

selecting a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values;

selecting a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset;

generating a first hash value corresponding to the selected first subset of character sequences;

generating a second hash value corresponding to the selected second subset of character sequences; and

comparing the first and second hash values with one or more hash values generated from a second email document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for detecting content similarity in email documents are disclosed. In one embodiment, a method comprises generating a first token value for each of a plurality of character sequences of a first email document, selecting a first subset of the plurality of character sequences based on the first token values, and generating one or more hash values corresponding to the selected first subset of character sequences. The method further comprises generating a second token value for each of a plurality of character sequences of a second email document, selecting a second subset of the plurality of character sequences based on the second token values, and generating one or more hash values corresponding to the selected second subset of character sequences. The method additionally comprises comparing the one or more hash values corresponding to the selected first subset with the one or more hash values corresponding to the selected second subset.

17 Citations

View as Search Results

19 Claims

1. A method for detecting content similarity in email documents comprising:
- generating a token value for each of a plurality of character sequences of a first email document;
  
  generating a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value;
  
  selecting a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values;
  
  selecting a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset;
  
  generating a first hash value corresponding to the selected first subset of character sequences;
  
  generating a second hash value corresponding to the selected second subset of character sequences; and
  
  comparing the first and second hash values with one or more hash values generated from a second email document.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1 further comprising disregarding predetermined content from the first email document when selecting the first and second subsets.
  - 3. The method of claim 2, wherein the predetermined content includes email header information.
  - 4. The method of claim 1, further comprising generating a similarity indication in response to the comparing.

5. A non-transitory computer-readable medium storing program instructions that are computer-executable to:
- generate a token value for each of a plurality of character sequences of a first email document;
  
  generate a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value;
  
  select a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values;
  
  select a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subsetgenerate a first hash value corresponding to the selected first subset of character sequences;
  
  generate a second hash value corresponding to the selected second subset of character sequences; and
  
  compare the first and second hash values with one or more hash values generated from a second email document.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The non-transitory computer-readable medium of claim 5, wherein the program instructions are further executable to generate a similarity indication in response to comparing the first and second hash values with the one or more hash values generated from the second email document.
  - 7. The non-transitory computer-readable medium of claim 5, wherein the program instructions are executable to generate a similarity indication in response to a predetermined ratio of hash value matches to hash value mismatches.
  - 8. The non-transitory computer-readable medium of claim 5, wherein the program instructions are executable to generate a similarity indication in response to a user-specified threshold level of matching hash values.
  - 9. The non-transitory computer-readable medium of claim 5, wherein the first and second hash values are generated using an MD5 or SHA-1 hashing algorithm.
  - 10. The non-transitory computer-readable medium of claim 5, wherein the plurality of character sequences includes words, and wherein the token values includes token values that-correspond to each of the words.
  - 11. The non-transitory computer-readable medium of claim 5, wherein a given one of the token values is generated based on ASCII ordinal values of each character in a character sequence.
  - 12. The non-transitory computer-readable medium of claim 5, wherein a given one of the token values is generated based on character positions of each character in a character sequence.

13. A system for detecting content similarity in email documents comprising:
- one or more processors; and
  
  memory storing program instructions that are executable by the one or more processors to;
  
  generate a token value for each of a plurality of character sequences of a first email document;
  
  generate a remainder value for each of the plurality of character sequences of the first email document by dividing each of the token values by a predetermined integer value;
  
  select a first subset of the plurality of character sequences of the first email document, wherein the first subset includes character sequences that have the same remainder values;
  
  select a second subset of the plurality of character sequences of the first email document, wherein the second subset includes character sequences that have the same remainder values, and wherein the remainder values of the second subset are different than the remainder values of the first subset;
  
  generate a first hash value corresponding to the selected first subset of character sequences;
  
  generate a second hash value corresponding to the selected second subset of character sequences; and
  
  compare the first and second hash values with one or more hash values generated from a second email document.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The system of claim 13, wherein the program instructions are executable to disregard predetermined content from the first email document when selecting the first and second subsets.
  - 15. The system of claim 14, wherein the predetermined content includes email header information.
  - 16. The system of claim 13, wherein the program instructions are further executable to generate a similarity indication in response to the first and second hash values matching one or more hash values generated from the second email document.
  - 17. The system of claim 13, wherein the program instructions are further executable to generate a similarity indication in response to a predetermined ratio of hash value matches to hash value mismatches.
  - 18. The system of claim 13 wherein the program instructions are further executable to generate a similarity indication in response to a user-specified threshold level of matching hash values.
  - 19. The system of claim 13, wherein the program instructions are further executable to generate a similarity indication in response to the first and second hash values not matching one or more hash values generated from the second email document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
CA, Inc. (d/b/a CA Technologies) (Broadcom, Inc.)
Original Assignee
Symantec Operating Corporation (NortonLifeLock Inc.)
Inventors
Ngan, Tsuen Wan
Primary Examiner(s)
Moore, Ian N
Assistant Examiner(s)
Zuniga, Jackie

Application Number

US12/059,148
Publication Number

US 20090089384A1
Time in Patent Office

1,639 Days
Field of Search

709204-207
US Class Current

709/206
CPC Class Codes

G06F 15/16 Combinations of two or more...

System and method for detecting content similarity within email documents by sparse subset hashing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

17 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for detecting content similarity within email documents by sparse subset hashing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links