Apparatus and method for auxiliary classification for generating features for a spam filtering model

US 8,112,484 B1
Filed: 05/31/2006
Issued: 02/07/2012
Est. Priority Date: 05/31/2006
Status: Active Grant

First Claim

Patent Images

1. A system for filtering spam email messages, the system comprising:

a memory for storing program code and a processor for processing program code to generate a plurality of modules further comprising;

a base spam filter feature extractor comprising first machine learning logic performing machine learning operations to detect base spam features from a training corpus of ham and spam email messages the base spam filter feature extractor performing the operations of;

analyzing an email message to detect whether the email message contains features within a base spam feature set;

firing one or more of the base spam features based, at least in part, on the analysis;

assigning a weight to each of the base spam features according to how well the base spam features correctly differentiate between ham and spam email messages;

an auxiliary obfuscation model feature extractor comprising second machine learning logic performing machine learning operations to detect text obfuscation within the training corpus of ham and spam email messages, the auxiliary obfuscation model feature extractor comprising an obfuscation feature set for detecting obfuscation within email messages, the auxiliary obfuscation model feature extractor performing the operations of;

analyzing an email message to detect whether the email message contains features within the obfuscation feature set;

firing one or more of the obfuscation detection features based, at least in part, on the analysis;

assigning a weight to each of the obfuscation detection features according to how well the obfuscation detection features correctly detect text obfuscation in email messages;

an auxiliary obfuscation detection module to receive an indication of the different sets of features and associated weights detected by the auxiliary obfuscation model feature extractor and to apply the associated weights to the detected features in a stream of incoming email messages; and

a base spam filter module to receive an indication of the base spam features and associated weights from the base spam filter feature extractor and the weights applied by the auxiliary obfuscation detection module to the stream of incoming email messages, the base spam filter module to apply base spam filter weights to the base spam features detected in the stream of incoming email messages and to determine whether an email message is spam based on a combined weights of the base spam features and the weights of the obfuscation features applied by the auxiliary obfuscation detection module;

summing the weights of the base spam features and the weights of features applied by the auxiliary obfuscation model feature extractor to generate a spam score; and

identifying the email message as spam if the spam score is above a specified threshold value.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented system and method are described for integrating a series of auxiliary spam detection models with a base spam detection model, thereby improving the accuracy and efficiency of the overall spam detection engine. For example, a system according to one embodiment of the invention comprises: a base spam filter feature extractor to detect a first set of features from incoming email messages; one or more auxiliary model feature extractors, each of the auxiliary model feature extractors to detect a different set of features from the incoming email messages; one or more auxiliary detection modules, each of the auxiliary detection modules to receive an indication of the different sets of features detected by a corresponding one of the auxiliary model feature extractor modules and to apply weights to the detected features; and a base spam filter module to receive an indication of the first set of features from the base spam filter feature extractor and the weights generated by the auxiliary detection modules, the base spam filter module to assign base spam filter weights to the first set of features and to determine whether an email message is spam based on the weights of the first set of features and the weights of the different set of features identified by the auxiliary model feature extractors.

Citations

9 Claims

1. A system for filtering spam email messages, the system comprising:
- a memory for storing program code and a processor for processing program code to generate a plurality of modules further comprising;
  
  a base spam filter feature extractor comprising first machine learning logic performing machine learning operations to detect base spam features from a training corpus of ham and spam email messages the base spam filter feature extractor performing the operations of;
  
  analyzing an email message to detect whether the email message contains features within a base spam feature set;
  
  firing one or more of the base spam features based, at least in part, on the analysis;
  
  assigning a weight to each of the base spam features according to how well the base spam features correctly differentiate between ham and spam email messages;
  
  an auxiliary obfuscation model feature extractor comprising second machine learning logic performing machine learning operations to detect text obfuscation within the training corpus of ham and spam email messages, the auxiliary obfuscation model feature extractor comprising an obfuscation feature set for detecting obfuscation within email messages, the auxiliary obfuscation model feature extractor performing the operations of;
  
  analyzing an email message to detect whether the email message contains features within the obfuscation feature set;
  
  firing one or more of the obfuscation detection features based, at least in part, on the analysis;
  
  assigning a weight to each of the obfuscation detection features according to how well the obfuscation detection features correctly detect text obfuscation in email messages;
  
  an auxiliary obfuscation detection module to receive an indication of the different sets of features and associated weights detected by the auxiliary obfuscation model feature extractor and to apply the associated weights to the detected features in a stream of incoming email messages; and
  
  a base spam filter module to receive an indication of the base spam features and associated weights from the base spam filter feature extractor and the weights applied by the auxiliary obfuscation detection module to the stream of incoming email messages, the base spam filter module to apply base spam filter weights to the base spam features detected in the stream of incoming email messages and to determine whether an email message is spam based on a combined weights of the base spam features and the weights of the obfuscation features applied by the auxiliary obfuscation detection module;
  
  summing the weights of the base spam features and the weights of features applied by the auxiliary obfuscation model feature extractor to generate a spam score; and
  
  identifying the email message as spam if the spam score is above a specified threshold value.
- View Dependent Claims (2, 3)
- - 2. The system as in claim 1 wherein the auxiliary obfuscation model feature extractor and the base spam filter feature extractor each implement different types of machine learning algorithms to calculate the weights of the features.
  - 3. The system as in claim 1 whereinthe training corpus is comprised of a first plurality of known spam email messages and a second plurality of known ham email messages, the training corpus used by the first and second machine learning logic to generate the base spam and obfuscation features and the weights associated with each of the features.

4. A computer-implemented method for filtering spam email messages comprising:
- providing a base spam filter feature extractor executed by a processor, comprising first machine learning logic performing machine learning operations to detect base spam features from a training corpus of ham and spam email messages the base spam filter feature extractor performing the operations of;
  
  analyzing an email message to detect whether the email message contains features within a base spam feature set;
  
  firing one or more of the base spam features based, at least in part, on the analysis;
  
  assigning a weight to each of the base spam features according to how well the base spam features correctly differentiate between ham and spam email messages;
  
  providing an auxiliary obfuscation model feature extractor comprising second machine learning logic performing machine learning operations to detect text obfuscation within the training corpus of ham and spam email messages, the auxiliary obfuscation model feature extractor comprising an obfuscation feature set for detecting obfuscation within email messages, the auxiliary obfuscation model feature extractor performing the operations of;
  
  analyzing an email message to detect whether the email message contains features within the obfuscation feature set;
  
  firing one or more of the obfuscation detection features based, at least in part, on the analysis;
  
  assigning a weight to each of the obfuscation detection features according to how well the obfuscation detection features correctly detect text obfuscation in email messages;
  
  providing an auxiliary obfuscation detection module to receive an indication of the different sets of features and associated weights detected by the auxiliary obfuscation model feature extractor and to apply the associated weights to the detected features in a stream of incoming email messages; and
  
  providing a base spam filter module to receive an indication of the base spam features and associated weights from the base spam filter feature extractor and the weights applied by the auxiliary obfuscation detection module to the stream of incoming email messages, the base spam filter module to apply base spam filter weights to the base spam features detected in the stream of incoming email messages and to determine whether an email message is spam based on a combined weights of the base spam features and the weights of the obfuscation features applied by the auxiliary obfuscation detection module;
  
  summing the weights of the base spam features and the weights of features applied by the auxiliary obfuscation model feature extractor to generate a spam score; and
  
  identifying the email message as spam if the spam score is above a specified threshold value.
- View Dependent Claims (5, 6)
- - 5. The method as in claim 4 wherein the auxiliary obfuscation model feature extractor and the base spam filter feature extractor each implement different types of machine learning algorithms to calculate the weights of the features.
  - 6. The method as in claim 4 whereinthe training corpus is comprised of a first plurality of known spam email messages and a second plurality of known ham email messages, the training corpus used by the first and second machine learning logic to generate the base spam and obfuscation features and the weights associated with each of the features.

7. A non-transitory machine-readable medium having program code stored thereon which, when executed by a machine, causes the machine to perform the operations of:
- providing a base spam filter feature extractor comprising first machine learning logic performing machine learning operations to detect base spam features from a training corpus of ham and spam email messages the base spam filter feature extractor performing the operations of;
  
  analyzing an email message to detect whether the email message contains features within a base spam feature set;
  
  firing one or more of the base spam features based, at least in part, on the analysis;
  
  assigning a weight to each of the base spam features according to how well the base spam features correctly differentiate between ham and spam email messages;
  
  providing an auxiliary obfuscation model feature extractor comprising second machine learning logic performing machine learning operations to detect text obfuscation within the training corpus of ham and spam email messages, the auxiliary obfuscation model feature extractor comprising an obfuscation feature set for detecting obfuscation within email messages, the auxiliary obfuscation model feature extractor performing the operations of;
  
  analyzing an email message to detect whether the email message contains features within the obfuscation feature set;
  
  firing one or more of the obfuscation detection features based, at least in part, on the analysis;
  
  assigning a weight to each of the obfuscation detection features according to how well the obfuscation detection features correctly detect text obfuscation in email messages;
  
  providing an auxiliary obfuscation detection module to receive an indication of the different sets of features and associated weights detected by the auxiliary obfuscation model feature extractor and to apply the associated weights to the detected features in a stream of incoming email messages; and
  
  providing a base spam filter module to receive an indication of the base spam features and associated weights from the base spam filter feature extractor and the weights applied by the auxiliary obfuscation detection module to the stream of incoming email messages, the base spam filter module to apply base spam filter weights to the base spam features detected in the stream of incoming email messages and to determine whether an email message is spam based on a combined weights of the base spam features and the weights of the obfuscation features applied by the auxiliary obfuscation detection module;
  
  summing the weights of the base spam features and the weights of features applied by the auxiliary obfuscation model feature extractor to generate a spam score; and
  
  identifying the email message as spam if the spam score is above a specified threshold value.
- View Dependent Claims (8, 9)
- - 8. The machine-readable medium as in claim 7 wherein the auxiliary obfuscation model feature extractor and the base spam filter feature extractor each implement different types of machine learning algorithms to calculate the weights of the features.
  - 9. The machine-readable medium as in claim 7the training corpus is comprised of a first plurality of known spam email messages and a second plurality of known ham email messages, the training corpus used by the first and second machine learning logic to generate the base spam and obfuscation features and the weights associated with each of the features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Proofpoint Incorporated
Original Assignee
Proofpoint Incorporated
Inventors
Sharma, Vipul, Lewis, Steve
Primary Examiner(s)
Moore, Ian N
Assistant Examiner(s)
NGUYEN, THAI

Application Number

US11/444,593
Time in Patent Office

2,078 Days
Field of Search

709203-207
US Class Current

709/206
CPC Class Codes

H04L 51/212 using filtering or selectiv...

H04L 63/0227 Filtering policies mail mes...

Apparatus and method for auxiliary classification for generating features for a spam filtering model

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method for auxiliary classification for generating features for a spam filtering model

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links