×

Using message features and sender identity for email spam filtering

  • US 7,899,866 B1
  • Filed: 12/31/2004
  • Issued: 03/01/2011
  • Est. Priority Date: 12/31/2004
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method comprising:

  • receiving an email message;

    determining a sender identity associated with the email message based on a particular IP address from which the email message was sent;

    calculating a probability that the email message from the sender is spam based on the particular IP address;

    in an event that there is not enough data that has been previously collected in association with the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email address is spam according to data associated with a first range of IP addresses based on a top 24 bits of the particular IP address;

    in an event that there is not enough data that has been previously collected in association with the top 24 bits of the particular IP address to calculate the probability that the email message from the sender is spam, calculating the probability that the sender'"'"'s email message is spam according to data associated with a second range of IP addresses based on a top 16 bits of the particular IP address;

    identifying features of the email message;

    determining a first spam score based, at least in part, on a combination of a reputation associated with the sender identity and data associated with the features of the email message, wherein the reputation indicates whether email messages received from the sender identity tend to be spam, and wherein the data associated with the features of the email message indicates whether email messages having the features tend to be spam;

    comparing the first spam score to a first spam score threshold 1a, wherein 1a represents a maximum allowable first spam score; and

    in an event that the first spam score is greater than the first spam score threshold;

    determining a second spam score based, at least in part, on the data associated with the features of the email message;

    applying different treatments to the email message based on a combination of a first relationship between the first spam score and three first spam score threshold values including 1a, 1b, and 1c, and a second relationship between the second spam score and three second spam score threshold values including 2a, 2b, and 2c, 1a being higher than 1b, 1b being higher than 1c, 2a being higher than 2b, and 2b being higher than 2c, the three first spam score threshold values being different from the three second spam score threshold values and 1a being higher than 2a, in an event that the first spam score is greater than 1a and the second spam score is greater than 2a, deleting the email message;

    in an event that the first spam score is smaller than 1a but greater than 1b and the second spam score is smaller than 2a but greater than 2b, delivering the email message into a junk email box; and

    in an event that the first spam score is smaller than 1b but greater than 1c and the second spam score is smaller than 2b but greater than 2c, flagging the email message as potentially being spam and delivering the email message into an email inbox,wherein;

    the features include;

    a particular word or phrase appeared in the email message;

    a particular word or phrase appeared in an attachment of the email message;

    a header line of the email message;

    a day of a week the email message is sent;

    a time of a day the email message is sent; and

    a structure of a body of the email message;

    each particular features is assigned a numerical value based on a frequency of each particular feature is found in a good email message compared to a frequency of each particular feature is found in a spam email message;

    the reputation associated with the sender identity is gathered from a user'"'"'s feedback indicating a number of spam email message from the sender identity and a number of non-spam email message from the sender identity;

    the indication is reflected by a numerical value; and

    the determining operation of the second spam score comprises calculating a Sigmoid function, 1/(1+e

    w
    ), w being a sum of weighted values associated with the features of the email message, the Sigmoid function converting the sum of weighted values to a probability as a real number in a range [0, 1].

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×