Apparatus and method for obfuscation detection within a spam filtering model
First Claim
1. A computer-implemented method for detecting obfuscated words in email messages comprising:
- providing an obfuscation feature set for detecting obfuscation within email messages, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words;
analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs;
generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs;
applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value;
wherein the analysis for a second subset of features includes one or more of;
(1) determining a number of non-alphanumeric characters in each word;
(2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and
(4) determining whether the word is found in a dictionary;
applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;
(2) the length of the word is below a third specified threshold or above a fourth specified threshold;
(3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and
/or (4) the word is not found in a dictionary;
executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation;
applying weights to each of the obfuscation detection features detected in the email message; and
determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features;
summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and
identifying the email message as spam if the spam score is above a specified threshold value.
4 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented system and method are described for detecting obfuscated words in email messages and using this information to determine whether each email message is spam or valid email (ham). For example, a method according to one embodiment of the invention comprises: providing an obfuscation feature set for detecting obfuscation within email messages, the obfuscation feature set build from a group of obfuscation parameters including a similarity metric, the similarity metric using a set using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words; analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis includes determining the similarity of one or more words in the email message with each of the FOWs; generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is obfuscated; firing one or more of the obfuscation detection features based, at least in part, on the value of the similarity metric; analyzing the email message to detect whether the email contains one or more additional spam features unrelated to obfuscation; and determining whether the email message is spam based on the combined obfuscation detection features and the additional spam features.
-
Citations
18 Claims
-
1. A computer-implemented method for detecting obfuscated words in email messages comprising:
-
providing an obfuscation feature set for detecting obfuscation within email messages, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words; analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs; generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs; applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value; wherein the analysis for a second subset of features includes one or more of;
(1) determining a number of non-alphanumeric characters in each word;
(2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and
(4) determining whether the word is found in a dictionary;applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;
(2) the length of the word is below a third specified threshold or above a fourth specified threshold;
(3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and
/or (4) the word is not found in a dictionary;executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation; applying weights to each of the obfuscation detection features detected in the email message; and determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features; summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and identifying the email message as spam if the spam score is above a specified threshold value. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An obfuscation detection system for detecting obfuscation within email messages based on an obfuscation feature set, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words, the obfuscation detection system comprising:
-
an obfuscation model feature extractor analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs; and generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs; an obfuscation detection model applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value; wherein the analysis for a second subset of features includes one or more of;
(1) determining a number of non-alphanumeric characters in each word;
(2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and
(4) determining whether the word is found in a dictionary; andapplying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;
(2) the length of the word is below a third specified threshold or above a fourth specified threshold;
(3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and
/or (4) the word is not found in a dictionary;an obfuscation model training module executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation; and applying weights to each of the obfuscation detection features detected in the email message of the obfuscation detection model; and a base model spam filter determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features; summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and identifying the email message as spam if the spam score is above a specified threshold value. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of:
-
providing an obfuscation feature set for detecting obfuscation within email messages, each feature in the obfuscation feature set built from a group of obfuscation parameters including a similarity metric, the similarity metric using a set of frequently obfuscated words (FOW) selected from a larger set of obfuscated words; analyzing an email message to detect whether the email message contains features within the obfuscation feature set, wherein the analysis for a first subset of features within the feature set includes determining the similarity of one or more words in the email message with each of the FOWs; generating the similarity metric based on the analysis, the similarity metric providing a relative likelihood that each of the one or more words is similar to one of the FOWs; applying a first obfuscation detection feature of the one or more obfuscation detection features if the value of the similarity metric is above or equal to a first specified threshold value; wherein the analysis for a second subset of features includes one or more of;
(1) determining a number of non-alphanumeric characters in each word;
(2) determining the length of the word (3) determining a number of digits in the word excluding the boundaries; and
(4) determining whether the word is found in a dictionary;applying a second obfuscation detection feature if (1) the number of non-alphanumeric characters is above a second specified threshold;
(2) the length of the word is below a third specified threshold or above a fourth specified threshold;
(3) the number of digits in the word excluding boundaries is above a fifth specified threshold; and
/or (4) the word is not found in a dictionary;executing a machine learning algorithm on an email corpus of both known spam and known ham messages to apply weights to each of the features in the obfuscation feature set according to whether a high classification accuracy in differentiating between the known ham and known spam messages can be achieved, wherein the accuracy is estimated using cross validation; applying weights to each of the obfuscation detection features detected in the email message; determining whether the email message is spam based, at least in part, on both the applied obfuscation detection features and the weights applied to the obfuscation detection features; and summing weights associated with each of the obfuscation detection features and each of the additional spam features to generate a spam score; and identifying the email message as spam if the spam score is above a specified threshold value. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification