Using message features and sender identity for email spam filtering

US 7,899,866 B1
Filed: 12/31/2004
Issued: 03/01/2011
Est. Priority Date: 12/31/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

receiving an email message;

determining a sender identity associated with the email message based on a particular IP address from which the email message was sent;

calculating a probability that the email message from the sender is spam based on the particular IP address;

in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email address is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address;

in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email message is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address;

identifying features of the email message;

determining a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam;

comparing the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and

in an event that the first spam score is greater than the first spam score threshold;

determining a second spam score based, at least in part, on the data associated with the features of the email message;

applying different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the second spam score is greater than 2a, deleting the email message;

in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, delivering the email message into a junk email box; and

in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flagging the email message as potentially being spam and delivering the email message into an email inbox,wherein;

the features include;

a particular word or phrase appeared in the email message;

a particular word or phrase appeared in an attachment of the email message;

a header line of the email message;

a day of a week the email message is sent;

a time of a day the email message is sent; and

a structure of a body of the email message;

each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message;

the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity;

the indication is reflected by a numerical value; and

the determining operation of the second spam score comprises calculating a Sigmoid function, 1/(1+e^−

w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1].

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Email spam filtering is performed based on a sender reputation and message features. When an email message is received, a preliminary spam determination is made based, at least in part, on a combination of a reputation associated with the sender of the email message and one or more features of the email message. If the preliminary spam determination indicates that the message is spam, then a secondary spam determination is made based on one or more features of the received email message. If both the preliminary and secondary spam determinations indicate that the received email message is likely spam, then the message is treated as spam.

163 Citations

14 Claims

1. A method comprising:
- receiving an email message;
  
  determining a sender identity associated with the email message based on a particular IP address from which the email message was sent;
  
  calculating a probability that the email message from the sender is spam based on the particular IP address;
  
  in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email address is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address;
  
  in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email message is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address;
  
  identifying features of the email message;
  
  determining a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam;
  
  comparing the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and
  
  in an event that the first spam score is greater than the first spam score threshold;
  
  determining a second spam score based, at least in part, on the data associated with the features of the email message;
  
  applying different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the second spam score is greater than 2a, deleting the email message;
  
  in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, delivering the email message into a junk email box; and
  
  in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flagging the email message as potentially being spam and delivering the email message into an email inbox,wherein;
  
  the features include;
  
  a particular word or phrase appeared in the email message;
  
  a particular word or phrase appeared in an attachment of the email message;
  
  a header line of the email message;
  
  a day of a week the email message is sent;
  
  a time of a day the email message is sent; and
  
  a structure of a body of the email message;
  
  each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message;
  
  the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity;
  
  the indication is reflected by a numerical value; and
  
  the determining operation of the second spam score comprises calculating a Sigmoid function, 1/(1+e^−
  
  w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1].
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method as recited in claim 1, wherein determining the sender identity further comprises determining at least one of a group of sequentially numbered IP addresses of which an IP address from which the email address was sent is a member, an allocation of IP addresses of which an IP address from which the email address was sent is a member, a domain from which the email address appears to have been sent.
  - 3. The method as recited in claim 1, wherein determining the first spam score comprises:
    - determining a first probability that the email message is spam, given the sender identity;
      
      determining a second probability that the email message is spam, given the one or more features of the email message; and
      
      combining the first probability and the second probability.
  - 4. The method as recited in claim 3, wherein the combining comprises multiplying the first probability by the second probability.
  - 5. The method as recited in claim 1, wherein the first spam score comprises a real number in the range [0, 1].
  - 6. The method as recited in claim 3, wherein determining a first probability that the email message is spam, given the sender identity comprises calculating S_count/(S_count+G_count), where S_countrepresents a number of previously received spam email messages associated with the sender identity and where G_countrepresents a number of previously received non-spam email messages associated with the sender identity.
  - 7. The method as recited in claim 1, wherein determining the second spam score comprises calculating a probability that the email message is spam given the one or more features of the email message.

8. One or more computer-readable storage media comprising computer-executable instructions that, when executed, direct a computing system to perform a method, the method comprising:
- receiving an email message;
  
  determining a sender identity associated with the email message based on a particular IP address from which the email message was sent;
  
  calculating a probability that the email message from the sender is spam based on the particular IP address;
  
  in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email message is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address;
  
  in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email address is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address;
  
  identifying features of the email message;
  
  determining a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam;
  
  comparing the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and
  
  in an event that the first spam score is greater than the first spam score threshold;
  
  determining a second spam score based, at least in part, on the data associated with the features of the email message;
  
  applying different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the message second spam score is greater than 2a, deleting the email message;
  
  in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, delivering the email message into a junk email box; and
  
  in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flagging the emailmessage as potentially being spam and delivering the email message into an email inbox,wherein;
  
  the features include;
  
  a particular word or phrase appeared in the email message;
  
  a particular word or phrase appeared in an attachment of the email message;
  
  a header line of the email message;
  
  a day of a week the email message is sent;
  
  a time of a day the email message is sent; and
  
  a structure of a body of the email message;
  
  each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message;
  
  the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity;
  
  the indication is reflected by a numerical value; and
  
  the determining operation of the second spam score comprises calculating a Sigmoid function, 1/(1+e^−
  
  w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1].
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The one or more computer-readable storage media as recited in claim 8, wherein the probability that the email message from the sender is spam is calculated based on a domain name, a sending email address, or a combination thereof.
  - 10. The one or more computer-readable storage media as recited in claim 8, wherein the features of the email message comprise a message subject, message header data, textual message contents, a message attachment filename, a textual content of a message attachment, or a combination thereof.
  - 11. The one or more computer-readable storage media as recited in claim 8, the method further comprising calculating the first spam score as a mathematical probability that the email message is spam given the entity from which the email message was sent and given the features of the email message.
  - 12. The one or more computer-readable storage media as recited in claim 8, wherein the method further comprising:
    - calculating a first mathematical probability that the email message is spam given the entity from which the email message was sent, based on data generated from email messages that were previously received from the entity from which the email message was sent;
      
      calculating a second mathematical probability that the email message is spam given the features of the email message, based on data generated from email message that were previously received having the one or more features of the email message; and
      
      calculating the first spam score as a combination of the first and second mathematical probabilities.
  - 13. The one or more computer-readable storage media as recited in claim 8, embodied as at least one of an electronic mail server system or an electronic mail client application.

14. An electronic mail filter computing system comprising:
- one or more processors;
  
  one or more memories communicatively coupled to the one or more processors, the one or more memories having stored instructions that, when executed, configure the computing system to implement an email spam filter comprising a sender reputation data store, a features date store, a sender identification module, a message features extraction module, a features spam score determination module, and a sender spam score determination module, the email spam filter configured to;
  
  receive an email message;
  
  determine a sender identity associated with the email message based on a particular IP address from which the email message was sent;
  
  calculate a probability that the email message from the sender is spam based on the particular IP address;
  
  in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculate the probability that the sender'"'"'s email message is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address;
  
  in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculate the probability that the sender'"'"'s email address is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address;
  
  identify features of the email message;
  
  determine a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam;
  
  compare the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and
  
  in an event that the first spam score is greater than the first spam score threshold;
  
  determine a second spam score based, at least in part, on the data associated with the features of the email message;
  
  apply different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the message second spam score is greater than 2a, delete the email message;
  
  in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, deliver the email message into a junk email box; and
  
  in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flag the email message as potentially being spam and deliver the email message into an email inbox, wherein;
  
  the features include;
  
  a particular word or phrase appeared in the email message;
  
  a particular word or phrase appeared in an attachment of the email message;
  
  a header line of the email message;
  
  a day of a week the email message is sent;
  
  a time of a day the email message is sent; and
  
  a structure of a body of the email message;
  
  each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message;
  
  the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity;
  
  the indication is reflected by a numerical value; and
  
  the operation to determine the second spam score comprises calculating a Sigmoid function, 1/(1+e^−
  
  w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1].

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Buckingham, Jay T., Rounthwaite, Robert L., Hulten, Geoffrey J, Goodman, Joshua T.
Primary Examiner(s)
Zhang; Shirley X

Application Number

US11/027,895
Time in Patent Office

2,251 Days
Field of Search

706/206
US Class Current

709/206
CPC Class Codes

H04L 51/212 using filtering or selectiv...

Using message features and sender identity for email spam filtering

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

163 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

Using message features and sender identity for email spam filtering

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

163 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others