Spam filtering based on statistics and token frequency modeling
First Claim
1. A network device, comprising:
- a transceiver device that is operative to send and receive data over a network; and
a processor device that is operative to perform actions, comprising;
receiving a message;
determining a plurality of tokens from the received message based in part on a text body of the received message;
analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, including a spam message and a non-spam message;
selecting a message class for the received message based on a comparison of the assigned probability values, wherein a probability value is associated with each of the plurality of message classes, wherein the assigned probability values represent a plurality of complement probability values for each of the plurality of message classes, and wherein selecting the message class further comprises selecting a message class having a lowest complement probability value;
providing the message class selected, a list of tokens with associated token frequencies, and the plurality of tokens to a token frequency component that is configured for the selected message class, wherein the list of tokens are determined for the message class selected; and
using the token frequency component to determine a number of tokens in the plurality of tokens that result in an associated token frequency for each matching token in the list of tokens exceeding a token frequency threshold, wherein each number of tokens resulting in the associated token frequency and is selectively decremented over time as a period of time expires for each corresponding token; and
based on a comparison between a number of matching tokens in the received message for the selected message class to a matched token threshold provided by the frequency threshold component, identifying whether the received message is a spam message or a non-spam message.
6 Assignments
0 Petitions
Accused Products
Abstract
Embodiments are directed towards classifying messages as spam using a two phased approach. The first phase employs a statistical classifier to classify messages based on message content. The second phase targets specific message types to capture dynamic characteristics of the messages and identify spam messages using a token frequency based approach. A client component receives messages and sends them to the statistical classifier, which determines a probability that a message belongs to a particular type of class. The statistical classifier further provides other information about a message, including, a token list, and token thresholds. The message class, token list, and thresholds are provided to the second phase where a number of spam tokens in a given message for a given message class are determined. Based on the threshold, the client component then determines whether the message is spam or non-spam.
31 Citations
15 Claims
-
1. A network device, comprising:
-
a transceiver device that is operative to send and receive data over a network; and a processor device that is operative to perform actions, comprising; receiving a message; determining a plurality of tokens from the received message based in part on a text body of the received message; analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, including a spam message and a non-spam message; selecting a message class for the received message based on a comparison of the assigned probability values, wherein a probability value is associated with each of the plurality of message classes, wherein the assigned probability values represent a plurality of complement probability values for each of the plurality of message classes, and wherein selecting the message class further comprises selecting a message class having a lowest complement probability value; providing the message class selected, a list of tokens with associated token frequencies, and the plurality of tokens to a token frequency component that is configured for the selected message class, wherein the list of tokens are determined for the message class selected; and using the token frequency component to determine a number of tokens in the plurality of tokens that result in an associated token frequency for each matching token in the list of tokens exceeding a token frequency threshold, wherein each number of tokens resulting in the associated token frequency and is selectively decremented over time as a period of time expires for each corresponding token; and based on a comparison between a number of matching tokens in the received message for the selected message class to a matched token threshold provided by the frequency threshold component, identifying whether the received message is a spam message or a non-spam message. - View Dependent Claims (2, 3, 4)
-
-
5. A processor readable non-transitory storage medium that includes data and instructions, wherein the execution of the instructions on a computing device provides for managing messages by enabling actions, comprising:
-
receiving a message; determining a plurality of tokens from the received message based in part on a text body of the received message; employing a statistical classifier to; analyze the plurality of tokens to assign a plurality of probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is assigned to each of the plurality of message classes; select a message class for the received message based on a comparison of the plurality of probability values, wherein the assigned probability values represent a plurality of complement probability values for each of the plurality of message classes, and wherein selecting the message class further comprises selecting a message class having a lowest complement probability value; and employing a token frequency component that is operative based on the selected message class to determine a number of tokens in the plurality of tokens that result in a respective token count for each token in a list of tokens for the selected message class exceeding a token frequency threshold, wherein each token count associated with each token in the list is selectively decremented over time as a period of time expires for each corresponding token; performing a comparison between the number of matching tokens in the received message for the selected message class to a matched token threshold provided by the token frequency component to identify whether the received message is a spam message or a non-spam message; and employing the message identification as spam or non-spam to at least one of tagging the message, or distributing the message into a message folder. - View Dependent Claims (6, 7, 8, 9, 10)
-
-
11. A method for managing a message delivery, comprising:
-
receiving a message by a network device; employing the network device to determine a plurality of tokens from the received message based in part on a text body of the received message; employing the network device to analyze the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is associated with each of the plurality of message classes; employing the network device to select a message class for the received message based on a comparison of the assigned probability values, wherein the assigned probability values represent a plurality of complement probability values for each of the plurality of message classes, and wherein selecting the message class further comprises selecting a message class having a lowest complement probability value; employing the network device to determine a number of tokens in the plurality of tokens that result in a respective token count for each token in a list of tokens for the selected message class exceeding a token frequency threshold based on the selected message class, wherein each token count for each token in the list is selectively decremented over time as a period of time expires for each corresponding token; and based on a comparison between the number of matching tokens in the received message that exceed the token frequency threshold for the selected message class to a matched token threshold provided by a frequency threshold component, employing the network device to identify whether the received message is a spam message or a non-spam message; and employing the message identification as spam or non-spam to at least one of tagging the message, or distributing the message into a message folder. - View Dependent Claims (12, 13, 14, 15)
-
Specification