×

Apparatus and method for obfuscation detection within a spam filtering model

  • US 8,489,689 B1
  • Filed: 05/31/2006
  • Issued: 07/16/2013
  • Est. Priority Date: 05/31/2006
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for detecting obfuscated words in email messages comprising:

  • providing an obfuscation feature set for detecting obfuscation within email messages, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words;

    analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs;

    generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs;

    applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value;

    wherein the analysis for a second subset of features includes one or more of;

    (1) determining a number of non-alphanumeric characters in each word;

    (2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and

    (4) determining whether the word is found in a dictionary;

    applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;

    (2) the length of the word is below a third specified threshold or above a fourth specified threshold;

    (3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and

    /or (4) the word is not found in a dictionary;

    executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation;

    applying weights to each of the obfuscation detection features detected in the email message; and

    determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features;

    summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and

    identifying the email message as spam if the spam score is above a specified threshold value.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×