Apparatus and method for obfuscation detection within a spam filtering model

US 8,489,689 B1
Filed: 05/31/2006
Issued: 07/16/2013
Est. Priority Date: 05/31/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for detecting obfuscated words in email messages comprising:

providing an obfuscation feature set for detecting obfuscation within email messages, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words;

analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs;

generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs;

applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value;

wherein the analysis for a second subset of features includes one or more of;

(1) determining a number of non-alphanumeric characters in each word;

(2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and

(4) determining whether the word is found in a dictionary;

applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;

(2) the length of the word is below a third specified threshold or above a fourth specified threshold;

(3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and

/or (4) the word is not found in a dictionary;

executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation;

applying weights to each of the obfuscation detection features detected in the email message; and

determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features;

summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and

identifying the email message as spam if the spam score is above a specified threshold value.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented system and method are described for detecting obfuscated words in email messages and using this information to determine whether each email message is spam or valid email (ham). For example, a method according to one embodiment of the invention comprises: providing an obfuscation feature set for detecting obfuscation within email messages, the obfuscation feature set build from a group of obfuscation parameters including a similarity metric, the similarity metric using a set using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words; analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis includes determining the similarity of one or more words in the email message with each of the FOWs; generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is obfuscated; firing one or more of the obfuscation detection features based, at least in part, on the value of the similarity metric; analyzing the email message to detect whether the email contains one or more additional spam features unrelated to obfuscation; and determining whether the email message is spam based on the combined obfuscation detection features and the additional spam features.

Citations

18 Claims

1. A computer-implemented method for detecting obfuscated words in email messages comprising:
- providing an obfuscation feature set for detecting obfuscation within email messages, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words;
  
  analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs;
  
  generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs;
  
  applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value;
  
  wherein the analysis for a second subset of features includes one or more of;
  
  (1) determining a number of non-alphanumeric characters in each word;
  
  (2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and
  
  (4) determining whether the word is found in a dictionary;
  
  applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;
  
  (2) the length of the word is below a third specified threshold or above a fourth specified threshold;
  
  (3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and
  
  /or (4) the word is not found in a dictionary;
  
  executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation;
  
  applying weights to each of the obfuscation detection features detected in the email message; and
  
  determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features;
  
  summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and
  
  identifying the email message as spam if the spam score is above a specified threshold value.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method as in claim 1 wherein the obfuscation parameters in addition to the similarity metric include a number of non-alphanumeric characters in each word, a length of each word, a number of digits in each word and/or an indication of the presence of each word in a dictionary.
  - 3. The method as in claim 1 wherein the similarity metric comprises normalized values ranging from 0 to 1.
  - 4. The method as in claim 1 further comprising:
    - analyzing the email message to detect whether the email contains one or more additional spam features unrelated to obfuscation; and
      
      determining whether the email message is spam based on the combined obfuscation detection features and the additional spam features.
  - 5. The method as in claim 1 further comprising:
    - executing a machine learning algorithm on a training corpus of email messages containing known obfuscated and true words to generate the obfuscation feature set.
  - 6. The method as in claim 5 wherein the machine learning algorithm comprises logistic regression.

7. An obfuscation detection system for detecting obfuscation within email messages based on an obfuscation feature set, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words, the obfuscation detection system comprising:
- an obfuscation model feature extractor analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs; and
  
  generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs;
  
  an obfuscation detection model applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value;
  
  wherein the analysis for a second subset of features includes one or more of;
  
  (1) determining a number of non-alphanumeric characters in each word;
  
  (2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and
  
  (4) determining whether the word is found in a dictionary; and
  
  applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;
  
  (2) the length of the word is below a third specified threshold or above a fourth specified threshold;
  
  (3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and
  
  /or (4) the word is not found in a dictionary;
  
  an obfuscation model training module executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation; and
  
  applying weights to each of the obfuscation detection features detected in the email message of the obfuscation detection model; and
  
  a base model spam filter determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features;
  
  summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and
  
  identifying the email message as spam if the spam score is above a specified threshold value.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The obfuscation detection system as in claim 7 wherein the obfuscation parameters in addition to the similarity metric include a number of non-alphanumeric characters in each word, a length of each word, a number of digits in each word and/or an indication of the presence of each word in a dictionary.
  - 9. The obfuscation detection system as in claim 7 wherein the similarity metric comprises normalized values ranging from 0 to 1.
  - 10. The obfuscation detection system as in claim 7 further comprising:
    - analyzing the email message to detect whether the email contains one or more additional spam features unrelated to obfuscation; and
      
      determining whether the email message is spam based on the combined obfuscation detection features and the additional spam features.
  - 11. The obfuscation detection system as in claim 7 comprising additional program code to cause the processor to perform the operations of:
    - executing a machine learning algorithm on a training corpus of email messages containing known obfuscated and true words to generate the obfuscation feature set.
  - 12. The obfuscation detection system as in claim 11 wherein the machine learning algorithm comprises logistic regression.

13. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of:
- providing an obfuscation feature set for detecting obfuscation within email messages, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words;
  
  analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs;
  
  generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs;
  
  applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value;
  
  wherein the analysis for a second subset of features includes one or more of;
  
  (1) determining a number of non-alphanumeric characters in each word;
  
  (2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and
  
  (4) determining whether the word is found in a dictionary;
  
  applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;
  
  (2) the length of the word is below a third specified threshold or above a fourth specified threshold;
  
  (3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and
  
  /or (4) the word is not found in a dictionary;
  
  executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation;
  
  applying weights to each of the obfuscation detection features detected in the email message;
  
  determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features; and
  
  summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and
  
  identifying the email message as spam if the spam score is above a specified threshold value.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The machine-readable medium as in claim 13 wherein the obfuscation parameters in addition to the similarity metric include a number of non-alphanumeric characters in each word, a length of each word, a number of digits in each word and/or an indication of the presence of each word in a dictionary.
  - 15. The machine-readable medium as in claim 13 wherein the similarity metric comprises normalized values ranging from 0 to 1.
  - 16. The machine-readable medium as in claim 13 comprising additional program code to cause the machine to perform the operations of:
    - analyzing the email message to detect whether the email contains one or more additional spam features unrelated to obfuscation; and
      
      determining whether the email message is spam based on the combined obfuscation detection features and the additional spam features.
  - 17. The machine-readable medium as in claim 13 comprising additional program code to cause the machine to perform the operations of:
    - executing a machine learning algorithm on a training corpus of email messages containing known obfuscated and true words to generate the obfuscation feature set.
  - 18. The machine-readable medium as in claim 17 wherein the machine learning algorithm comprises logistic regression.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Proofpoint Incorporated
Original Assignee
Proofpoint Incorporated
Inventors
Sharma, Vipul, Lewis, Steve
Primary Examiner(s)
Wong, Warner
Assistant Examiner(s)
Huynh, Dung B

Application Number

US11/444,543
Time in Patent Office

2,603 Days
Field of Search

None
US Class Current

709/206
CPC Class Codes

G06Q 10/107 Computer-aided management o...

H04L 51/212 using filtering or selectiv...

Apparatus and method for obfuscation detection within a spam filtering model

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method for obfuscation detection within a spam filtering model

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links