Advanced spam detection techniques

US 9,305,079 B2
Filed: 08/01/2013
Issued: 04/05/2016
Est. Priority Date: 06/23/2003
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method for filtering messages, comprising:

receiving a first electronic mail (email) message;

analyzing a portion of the first email message by searching for character sequences that are indicative of spam, wherein the character sequences correspond to one or more runs of characters of a particular run length including individual lengths of characters and sub-lengths of characters that are not restricted to whole words or space-separated words;

determining a degree of randomness associated with an individual character sequence of the character sequences;

generating a feature relating to the individual character sequence based at least partly on the degree of randomness associated with the individual character sequence;

training a machine learning filter using at least the feature to generate a trained machine learning filter;

employing the trained machine learning filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, andfiltering the second email message based at least in part on the verdict.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The subject invention provides for an advanced and robust system and method that facilitates detecting spam. The system and method include components as well as other operations which enhance or promote finding characteristics that are difficult for the spammer to avoid and finding characteristics in non-spam that are difficult for spammers to duplicate. Exemplary characteristics include analyzing character and/or number sequences, strings, and sub-strings, detecting various entropy levels of one or more character sequences, strings and/or sub-strings and analyzing message headers.

Citations

20 Claims

1. A computer-implemented method for filtering messages, comprising:
- receiving a first electronic mail (email) message;
  
  analyzing a portion of the first email message by searching for character sequences that are indicative of spam, wherein the character sequences correspond to one or more runs of characters of a particular run length including individual lengths of characters and sub-lengths of characters that are not restricted to whole words or space-separated words;
  
  determining a degree of randomness associated with an individual character sequence of the character sequences;
  
  generating a feature relating to the individual character sequence based at least partly on the degree of randomness associated with the individual character sequence;
  
  training a machine learning filter using at least the feature to generate a trained machine learning filter;
  
  employing the trained machine learning filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, andfiltering the second email message based at least in part on the verdict.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. A method as recited in claim 1, wherein the character sequences comprise character n-grams that are indicative of spam-like messages.
  - 3. A method as recited in claim 2, wherein the character n-grams are located in at least one of a from address, a subject line, a text body, an html body, or an attachment.
  - 4. A method as recited in claim 1, wherein the first email message comprises at least one of foreign language text, Unicode character types, or other character types not common to English.
  - 5. A method as recited in claim 4, wherein the foreign language text comprises substantially non-space separated words.
  - 6. A method as recited in claim 1, wherein the character sequences that are indicative of spam comprise strings of random characters.
  - 7. A method as recited in claim 1, wherein the analyzing the portion of the first email message comprises processing at least a portion of the first email message in which the individual character sequence occurs.
  - 8. A method as recited in claim 7, wherein the processing at least the portion of the first email message comprises determining an average degree of randomness associated with the portion of the first email message, and wherein the feature relates to a comparison between the degree of randomness associated with the individual character sequence and the average degree of randomness associated with the portion of the first email message.
  - 9. A method as recited in claim 7, further comprising calculating an entropy of a particular run of characters of the one or more runs of characters and employing the entropy as an additional feature in connection with training the machine learning filter.
  - 10. A method as recited in claim 9, wherein the entropy is an average entropy calculated as an entropy per character of the particular run of characters.
  - 11. A method as recited in claim 9, wherein the entropy is a relative entropy determined by a comparison of a first entropy of a first particular run of characters at a first location within the first email message relative to a second entropy of a second particular run of characters at a second location within the first email message.
  - 12. A method as recited in claim 11, wherein the first location and the second location comprise one of a beginning of a message body of the first email message, a middle of the message body of the first email message, or an end of the message body of the first email message, wherein the first location is different from the second location.
  - 13. A method as recited in claim 1, wherein employing the trained machine learning filter to obtain the verdict as to whether the one or more features of the second email message indicate that the second email message is likely to be spam comprises:
    - receiving the second email message;
      
      generating the one or more features of the second email message based on at least one of one or more runs of characters in the second email message and entropy determinations of the one or more runs of characters in the second email message;
      
      passing the one or more features of the second email message through the trained machine learning filter; and
      
      obtaining the verdict as to whether the one or more features of the second email message indicate that the second email message is likely to be spam.

14. A computer-implemented method for filtering messages, comprising:
- receiving a first electronic mail (email) message;
  
  analyzing one or more features of a message header associated with the first email message;
  
  analyzing a portion of the first email message by searching for character sequences that are indicative of spam, the character sequences corresponding to one or more runs of characters of a particular run length;
  
  determining a degree of randomness for an individual run of characters of the one or more runs of characters;
  
  determining an average degree of randomness for the portion of the first email message within which the individual run of characters occurs;
  
  generating a feature relating to the individual run of characters based at least in part on a comparison between the degree of randomness and the average degree of randomness; and
  
  training a machine learning spam filter using the feature to generate a trained machine learning spam filter;
  
  employing the trained machine learning spam filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, andfiltering the second email message based at least in part on the verdict.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. A method as recited in claim 14, wherein the one or more features of the message header comprise at least one of a presence or absence of at least one message header type, the at least one message header type comprising at least one of X-Priority, mail software, or a header line for unsubscribing.
  - 16. A method as recited in claim 15, wherein the one or more features of the message header further comprise content associated with the at least one message header type.
  - 17. A method as recited in claim 14, further comprising:
    - analyzing at least a portion of the first email message for images and related image information;
      
      generating image features relating to one of the images and the related image information; and
      
      further training the machine learning spam filter using the image features.
  - 18. A method as recited in claim 17, wherein the related image information comprises one or more of image size, image quantity, location of image, image dimensions, or image type.
  - 19. A method as recited in claim 17, wherein the related image information comprises a first URL and a second URL such that the image is represented within a hyperlink.

20. A computer storage device having computer executable instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to:
- analyze a first portion of a first electronic mail (email) message by searching for particular character sequences that are indicative of spam, wherein the particular character sequences correspond to one or more runs of characters of a particular run length;
  
  analyze a second portion of the first email message by searching for instances of strings of random characters that are indicative of spam;
  
  analyze a message header associated with the first email message;
  
  determining a degree of randomness associated with at least one of a run of characters of the one or more runs of characters, an instance of the instances of strings of random characters, or the message header;
  
  generate features comprising character sequence features relating to the particular character sequences, strings of random character features relating to the strings of random characters, message header features relating to the message header, and a feature based on the degree of randomness; and
  
  train a machine learning spam filter using the features that are generated to generate a trained machine learning spam filter;
  
  employing the trained machine learning spam filter to obtain a verdict as to whether one or more features of a second email message indicate that the second email message is likely to be spam, andfiltering the second email message based at least in part on the verdict.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Starbuck, Bryan T., Rounthwaite, Robert L., Heckerman, David E., Goodman, Joshua T., Gillum, Eliot C., Howell, Nathan D, Aldinger, Kenneth R.
Primary Examiner(s)
Bayard, Djenane

Application Number

US13/957,313
Publication Number

US 20130318116A1
Time in Patent Office

978 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06Q 10/107   Computer-aided management o...

H04L 51/212   using filtering or selectiv...

Advanced spam detection techniques

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Advanced spam detection techniques

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links