Using message features and sender identity for email spam filtering
First Claim
Patent Images
1. A method comprising:
- receiving an email message;
determining a sender identity associated with the email message based on a particular IP address from which the email message was sent;
calculating a probability that the email message from the sender is spam based on the particular IP address;
in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email address is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address;
in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email message is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address;
identifying features of the email message;
determining a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam;
comparing the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and
in an event that the first spam score is greater than the first spam score threshold;
determining a second spam score based, at least in part, on the data associated with the features of the email message;
applying different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the second spam score is greater than 2a, deleting the email message;
in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, delivering the email message into a junk email box; and
in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flagging the email message as potentially being spam and delivering the email message into an email inbox,wherein;
the features include;
a particular word or phrase appeared in the email message;
a particular word or phrase appeared in an attachment of the email message;
a header line of the email message;
a day of a week the email message is sent;
a time of a day the email message is sent; and
a structure of a body of the email message;
each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message;
the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity;
the indication is reflected by a numerical value; and
the determining operation of the second spam score comprises calculating a Sigmoid function, 1/(1+e−
w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1].
2 Assignments
0 Petitions
Accused Products
Abstract
Email spam filtering is performed based on a sender reputation and message features. When an email message is received, a preliminary spam determination is made based, at least in part, on a combination of a reputation associated with the sender of the email message and one or more features of the email message. If the preliminary spam determination indicates that the message is spam, then a secondary spam determination is made based on one or more features of the received email message. If both the preliminary and secondary spam determinations indicate that the received email message is likely spam, then the message is treated as spam.
163 Citations
14 Claims
-
1. A method comprising:
-
receiving an email message; determining a sender identity associated with the email message based on a particular IP address from which the email message was sent; calculating a probability that the email message from the sender is spam based on the particular IP address; in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email address is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address; in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email message is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address; identifying features of the email message; determining a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam; comparing the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and in an event that the first spam score is greater than the first spam score threshold; determining a second spam score based, at least in part, on the data associated with the features of the email message; applying different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the second spam score is greater than 2a, deleting the email message; in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, delivering the email message into a junk email box; and in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flagging the email message as potentially being spam and delivering the email message into an email inbox, wherein; the features include; a particular word or phrase appeared in the email message; a particular word or phrase appeared in an attachment of the email message; a header line of the email message; a day of a week the email message is sent; a time of a day the email message is sent; and a structure of a body of the email message; each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message; the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity; the indication is reflected by a numerical value; and the determining operation of the second spam score comprises calculating a Sigmoid function, 1/(1+e−
w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1]. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more computer-readable storage media comprising computer-executable instructions that, when executed, direct a computing system to perform a method, the method comprising:
-
receiving an email message; determining a sender identity associated with the email message based on a particular IP address from which the email message was sent; calculating a probability that the email message from the sender is spam based on the particular IP address; in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email message is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address; in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email address is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address; identifying features of the email message; determining a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam; comparing the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and in an event that the first spam score is greater than the first spam score threshold; determining a second spam score based, at least in part, on the data associated with the features of the email message; applying different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the message second spam score is greater than 2a, deleting the email message; in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, delivering the email message into a junk email box; and in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flagging the email message as potentially being spam and delivering the email message into an email inbox, wherein; the features include; a particular word or phrase appeared in the email message; a particular word or phrase appeared in an attachment of the email message; a header line of the email message; a day of a week the email message is sent; a time of a day the email message is sent; and a structure of a body of the email message; each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message; the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity; the indication is reflected by a numerical value; and the determining operation of the second spam score comprises calculating a Sigmoid function, 1/(1+e−
w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1]. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. An electronic mail filter computing system comprising:
-
one or more processors; one or more memories communicatively coupled to the one or more processors, the one or more memories having stored instructions that, when executed, configure the computing system to implement an email spam filter comprising a sender reputation data store, a features date store, a sender identification module, a message features extraction module, a features spam score determination module, and a sender spam score determination module, the email spam filter configured to; receive an email message; determine a sender identity associated with the email message based on a particular IP address from which the email message was sent; calculate a probability that the email message from the sender is spam based on the particular IP address; in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculate the probability that the sender'"'"'s email message is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address; in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculate the probability that the sender'"'"'s email address is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address; identify features of the email message; determine a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam; compare the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and in an event that the first spam score is greater than the first spam score threshold; determine a second spam score based, at least in part, on the data associated with the features of the email message; apply different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the message second spam score is greater than 2a, delete the email message; in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, deliver the email message into a junk email box; and in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flag the email message as potentially being spam and deliver the email message into an email inbox, wherein; the features include; a particular word or phrase appeared in the email message; a particular word or phrase appeared in an attachment of the email message; a header line of the email message; a day of a week the email message is sent; a time of a day the email message is sent; and a structure of a body of the email message; each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message; the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity; the indication is reflected by a numerical value; and the operation to determine the second spam score comprises calculating a Sigmoid function, 1/(1+e−
w), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1].
-
Specification