SPAM FILTERING BASED ON STATISTICS AND TOKEN FREQUENCY MODELING
First Claim
1. A network device, comprising:
- a transceiver to send and receive data over a network; and
a processor that is operative to perform actions, comprising;
receiving a message;
determining a plurality of tokens from the received message based in part on a text body of the received message;
analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, including a spam message and a non-spam message;
selecting a message class for the received message based on a comparison of the assigned probability values, wherein a probability value is associated with each of the plurality of message classes;
providing the message class selected, a list of tokens with associated token frequencies, and the plurality of tokens to a token frequency component that is configured for the selected message class, wherein the list of tokens are determined for the message class selected; and
using the token frequency component to determine a number of tokens in the plurality of tokens that result in an associated token frequency for each matching token in the list of tokens exceeding a token frequency threshold; and
based on a comparison between the number of tokens exceeding the token frequency threshold to a matched token threshold identifying the received message as a spam message or a non-spam message.
6 Assignments
0 Petitions
Accused Products
Abstract
Embodiments are directed towards classifying messages as spam using a two phased approach. The first phase employs a statistical classifier to classify messages based on message content. The second phase targets specific message types to capture dynamic characteristics of the messages and identify spam messages using a token frequency based approach. A client component receives messages and sends them to the statistical classifier, which determines a probability that a message belongs to a particular type of class. The statistical classifier further provides other information about a message, including, a token list, and token thresholds. The message class, token list, and thresholds are provided to the second phase where a number of spam tokens in a given message for a given message class are determined. Based on the threshold, the client component then determines whether the message is spam or non-spam.
123 Citations
20 Claims
-
1. A network device, comprising:
-
a transceiver to send and receive data over a network; and a processor that is operative to perform actions, comprising; receiving a message; determining a plurality of tokens from the received message based in part on a text body of the received message; analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, including a spam message and a non-spam message; selecting a message class for the received message based on a comparison of the assigned probability values, wherein a probability value is associated with each of the plurality of message classes; providing the message class selected, a list of tokens with associated token frequencies, and the plurality of tokens to a token frequency component that is configured for the selected message class, wherein the list of tokens are determined for the message class selected; and using the token frequency component to determine a number of tokens in the plurality of tokens that result in an associated token frequency for each matching token in the list of tokens exceeding a token frequency threshold; and based on a comparison between the number of tokens exceeding the token frequency threshold to a matched token threshold identifying the received message as a spam message or a non-spam message. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A processor readable storage medium that includes data and instructions, wherein the execution of the instructions on a computing device provides for managing messages by enabling actions, comprising:
-
receiving a message; determining a plurality of tokens from the received message based in part on a text body of the received message; employing a statistical classifier to; analyze the plurality of tokens to assign a plurality of probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is assigned to each of the plurality of message classes; select a message class for the received message based on a comparison of the plurality of probability values; and employing a token frequency component configured based on the selected message class to determine a number of tokens in the plurality of tokens that result in a respective token count for each token in a list of tokens for the selected message class exceeding a token frequency threshold; performing a comparison between the number of tokens to a matched token threshold to identify the received message as a spam message or a non-spam message; and employing the message identification as spam or non-spam to at least one of tagging the message, or distributing the message into a message folder. - View Dependent Claims (7, 8, 9, 10, 11)
-
-
12. A method for managing a message delivery, comprising:
-
receiving a message; determining a plurality of tokens from the received message based in part on a text body of the received message; analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is associated with each of the plurality of message classes; selecting a message class for the received message based on a comparison of the assigned probability values; determining a number of tokens in the plurality of tokens that result in a respective token count for each token in a list of tokens for the selected message class exceeding a token frequency threshold; and based on a comparison between the number of tokens that exceed the token frequency threshold to a matched token threshold for the message class, identifying the received message as a spam message or a non-spam message. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A system for enabling a communications over a network, comprising:
-
a statistical classifier that is configured to perform actions, including; determining a plurality of tokens from the received message based in part on a text body of the received message; analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is assigned to each of the plurality of message classes; and selecting a message class for the received message based on a comparison of the determined plurality of probability values; and a token frequency component that is configured to perform actions, including; receiving the message class selected, a list of tokens and associated count of each token in the list of tokens, a token frequency threshold, and the plurality of tokens; determining a number of tokens for the received message that result in a respective token count in the list of tokens for the selected message class exceeding the token frequency threshold; and providing the number of tokens exceeding the token frequency threshold to a client component, such that the client component can performing a comparison between the number of tokens to a matched token threshold for the message class to identify the received message as a spam message or a non-spam message. - View Dependent Claims (18, 19, 20)
-
Specification