Advanced spam detection techniques
First Claim
Patent Images
1. A computer-implemented method for filtering messages, comprising:
- receiving a first electronic mail (email) message;
analyzing a portion of the first email message by searching for character sequences that are indicative of spam, wherein the character sequences correspond to one or more runs of characters of a particular run length including individual lengths of characters and sub-lengths of characters that are not restricted to whole words or space-separated words;
determining a degree of randomness associated with an individual character sequence of the character sequences;
generating a feature relating to the individual character sequence based at least partly on the degree of randomness associated with the individual character sequence;
training a machine learning filter using at least the feature to generate a trained machine learning filter;
employing the trained machine learning filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, andfiltering the second email message based at least in part on the verdict.
1 Assignment
0 Petitions
Accused Products
Abstract
The subject invention provides for an advanced and robust system and method that facilitates detecting spam. The system and method include components as well as other operations which enhance or promote finding characteristics that are difficult for the spammer to avoid and finding characteristics in non-spam that are difficult for spammers to duplicate. Exemplary characteristics include analyzing character and/or number sequences, strings, and sub-strings, detecting various entropy levels of one or more character sequences, strings and/or sub-strings and analyzing message headers.
-
Citations
20 Claims
-
1. A computer-implemented method for filtering messages, comprising:
-
receiving a first electronic mail (email) message; analyzing a portion of the first email message by searching for character sequences that are indicative of spam, wherein the character sequences correspond to one or more runs of characters of a particular run length including individual lengths of characters and sub-lengths of characters that are not restricted to whole words or space-separated words; determining a degree of randomness associated with an individual character sequence of the character sequences; generating a feature relating to the individual character sequence based at least partly on the degree of randomness associated with the individual character sequence; training a machine learning filter using at least the feature to generate a trained machine learning filter; employing the trained machine learning filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, and filtering the second email message based at least in part on the verdict. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer-implemented method for filtering messages, comprising:
-
receiving a first electronic mail (email) message; analyzing one or more features of a message header associated with the first email message; analyzing a portion of the first email message by searching for character sequences that are indicative of spam, the character sequences corresponding to one or more runs of characters of a particular run length; determining a degree of randomness for an individual run of characters of the one or more runs of characters; determining an average degree of randomness for the portion of the first email message within which the individual run of characters occurs; generating a feature relating to the individual run of characters based at least in part on a comparison between the degree of randomness and the average degree of randomness; and training a machine learning spam filter using the feature to generate a trained machine learning spam filter; employing the trained machine learning spam filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, and filtering the second email message based at least in part on the verdict. - View Dependent Claims (15, 16, 17, 18, 19)
-
-
20. A computer storage device having computer executable instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to:
-
analyze a first portion of a first electronic mail (email) message by searching for particular character sequences that are indicative of spam, wherein the particular character sequences correspond to one or more runs of characters of a particular run length; analyze a second portion of the first email message by searching for instances of strings of random characters that are indicative of spam; analyze a message header associated with the first email message; determining a degree of randomness associated with at least one of a run of characters of the one or more runs of characters, an instance of the instances of strings of random characters, or the message header; generate features comprising character sequence features relating to the particular character sequences, strings of random character features relating to the strings of random characters, message header features relating to the message header, and a feature based on the degree of randomness; and train a machine learning spam filter using the features that are generated to generate a trained machine learning spam filter; employing the trained machine learning spam filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, and filtering the second email message based at least in part on the verdict.
-
Specification