SYSTEM AND METHOD FOR DETECTING CONTENT SIMILARITY WITHIN EMAILS DOCUMENTS EMPLOYING SELECTIVE TRUNCATION

US 20090089383A1
Filed: 03/31/2008
Published: 04/02/2009
Est. Priority Date: 09/30/2007
Status: Abandoned Application

First Claim

Patent Images

1. A method, comprising:

generating a first token value dependent on a first subset of characters at a beginning portion of a first email document;

generating a second token value dependent on a second subset of characters at an ending portion of the first email document;

depending upon the first and second token values, selectively generating one or more hash values corresponding to a sequence of characters between the first subset and the second subset;

generating a third token value dependent on a third subset of characters at a beginning portion of a second email document;

generating a fourth token value dependent on a fourth subset of characters at an ending portion of the second email document;

depending upon the third and fourth token values, selectively generating one or more hash values corresponding to a sequence of characters between the third subset and the fourth subset; and

comparing the one or more hash values corresponding to the sequence of characters between the first subset and the second subset with the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and a method for detecting content similarities in different emails employing selective truncation are disclosed. In one embodiment, a method comprises generating a first token value dependent on a first subset of characters at a beginning portion of a first email document, generating a second token value dependent on a second subset of characters at an ending portion of a first email document, and depending upon the first and second token values, selectively generating one or more hash values corresponding to a sequence of characters between the first subset and the second subset. The method further comprises generating a third token value dependent on a third subset of characters at a beginning portion of a second email document, generating a forth token value dependent on a forth subset of characters at an ending portion of a second email document, depending upon the first and second token values, and selectively generating one or more hash values corresponding to a sequence of characters between the first subset and the second subset. The method finally comprises comparing the one or more hash values corresponding to the sequence of characters between the first subset and the second subset with the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.

17 Citations

View as Search Results

20 Claims

1. A method, comprising:
- generating a first token value dependent on a first subset of characters at a beginning portion of a first email document;
  
  generating a second token value dependent on a second subset of characters at an ending portion of the first email document;
  
  depending upon the first and second token values, selectively generating one or more hash values corresponding to a sequence of characters between the first subset and the second subset;
  
  generating a third token value dependent on a third subset of characters at a beginning portion of a second email document;
  
  generating a fourth token value dependent on a fourth subset of characters at an ending portion of the second email document;
  
  depending upon the third and fourth token values, selectively generating one or more hash values corresponding to a sequence of characters between the third subset and the fourth subset; and
  
  comparing the one or more hash values corresponding to the sequence of characters between the first subset and the second subset with the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, further comprising:
    - iteratively generating a token value corresponding to each of a plurality of additional subsets of characters at the beginning portion of the first email document; and
      
      selecting a plurality of truncation positions at the beginning portion of the first email document depending upon the token values.
  - 3. The method of claim 1, further comprising:
    - iteratively generating a token value corresponding to each of a plurality of additional subsets of characters at the ending portion of the first email document; and
      
      selecting a plurality of truncation positions at the ending portion of the first email document depending upon the token values.
  - 4. The method of claim 2, further comprising generating a plurality of hash values wherein each hash value is generated based on a corresponding sequence of characters between a respective one of the plurality of truncation positions at the beginning portion of the first email document and a respective one at the end portion of the first email document.
  - 5. The method of claim 1, further comprising generating a similarity indication in response to the comparing.

6. A computer-readable memory medium, storing program instructions that are computer-executable to:
- generate a first token value dependent on a first subset of characters at a beginning portion of a first email document;
  
  generate a second token value dependent on a second subset of characters at an ending portion of the first email document;
  
  depending upon the first and second token values, selectively generate one or more hash values corresponding to a sequence of characters between the first subset and the second subset;
  
  generate a third token value dependent on a third subset of characters at a beginning portion of a second email document;
  
  generate a fourth token value dependent on a fourth subset of characters at an ending portion of the second email document;
  
  depending upon the third and fourth token values, selectively generate one or more hash values corresponding to a sequence of characters between the third subset and the fourth subset; and
  
  compare the one or more hash values corresponding to the sequence of characters between the first subset and the second subset with the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13)
- - 7. The computer-readable memory medium of claim 6, wherein the program instructions are further executable to generate a similarity indication in response to comparing the one or more hash values corresponding to the sequence of characters between the first subset and the second subset with the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.
  - 8. The computer-readable memory medium of claim 6, wherein the program instructions are further executable to:
    - iteratively generate a token value corresponding to each of a plurality of additional subsets of characters at the beginning portion of the first email document; and
      
      select a plurality of truncation positions at the beginning portion of the first email document depending upon the token values.
  - 9. The computer-readable memory medium of claim 8, wherein the program instructions are further executable to generate a plurality of hash values, wherein each hash value is generated based on a corresponding sequence of characters between a respective one of the plurality of truncation positions at the beginning portion of the first email document and a respective one at the end portion of the first email document.
  - 10. The computer-readable memory medium of claim 6, wherein one or more of the generated hash values are generated using an MD5 or SHA-1 hashing algorithm.
  - 11. The computer-readable memory medium of claim 6, wherein the first and second subsets of characters includes words and wherein the first and second token values are generated based on one or more of the words.
  - 12. The computer-readable memory medium of claim 6, wherein the token values are generated based on ASCII ordinal values of each character in a subset of characters.
  - 13. The computer-readable memory medium of claim 6, wherein the token values are generated based on character positions of each character in a subset of characters.

14. A system, comprising:
- one or more processors; and
  
  memory storing program instructions that are executable by the one or more processors to;
  
  generate a first token value dependent on a first subset of characters at a beginning portion of a first email document;
  
  generate a second token value dependent on a second subset of characters at an ending portion of the first email document;
  
  depending upon the first and second token values, selectively generate one or more hash values corresponding to a sequence of characters between the first subset and the second subset;
  
  generate a third token value dependent on a third subset of characters at a beginning portion of a second email document;
  
  generate a fourth token value dependent on a fourth subset of characters at an ending portion of the second email document;
  
  depending upon the third and fourth token values, selectively generate one or more hash values corresponding to a sequence of characters between the third subset and the fourth subset; and
  
  compare the one or more hash values corresponding to the sequence of characters between the first subset and the second subset with the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The system of claim 14, wherein the program instructions are further executable to disregard predetermined content from the first and second email documents prior to generating the one or more hash values corresponding to the sequence of characters between the first subset and the second subset and the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.
  - 16. The system of claim 15, wherein the predetermined content includes email header information.
  - 17. The system of claim 14, wherein the program instructions are further executable to generate a similarity indication in response to comparing the one or more hash values corresponding to the sequence of characters between the first subset and the second subset and the one or more hash values corresponding to the sequence of characters between the third subset and the fourth subset.
  - 18. The system of claim 14, wherein the program instructions are further executable to generate a similarity indication in response to a user-specified minimum number of matching hash values between the first and second email documents.
  - 19. The system of claim 14, wherein the program instructions are further executable to:
    - iteratively generate a token value corresponding to each of a plurality of additional subsets of characters at the beginning portion of the first email document; and
      
      select a plurality of truncation positions at the beginning portion of the first email document depending upon the token values.
  - 20. The system of claim 19, wherein the program instructions are further executable to generate a plurality of hash values, wherein each hash value is generated based on a corresponding sequence of characters between a respective one of the plurality of truncation positions at the beginning portion of the first email document and a respective one at the end portion of the first email document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Symantec Corporation (NortonLifeLock Inc.)
Original Assignee
Symantec Corporation (NortonLifeLock Inc.)
Inventors
Ngan, Tsuen Wan

Application Number

US12/059,130
Publication Number

US 20090089383A1
Time in Patent Office

Days
Field of Search
US Class Current

709/206
CPC Class Codes

G06F 15/16 Combinations of two or more...

SYSTEM AND METHOD FOR DETECTING CONTENT SIMILARITY WITHIN EMAILS DOCUMENTS EMPLOYING SELECTIVE TRUNCATION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR DETECTING CONTENT SIMILARITY WITHIN EMAILS DOCUMENTS EMPLOYING SELECTIVE TRUNCATION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links