SPAM FILTERING BASED ON STATISTICS AND TOKEN FREQUENCY MODELING

US 20100145900A1
Filed: 12/04/2008
Published: 06/10/2010
Est. Priority Date: 12/04/2008
Status: Active Grant

First Claim

Patent Images

1. A network device, comprising:

a transceiver to send and receive data over a network; and

a processor that is operative to perform actions, comprising;

receiving a message;

determining a plurality of tokens from the received message based in part on a text body of the received message;

analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, including a spam message and a non-spam message;

selecting a message class for the received message based on a comparison of the assigned probability values, wherein a probability value is associated with each of the plurality of message classes;

providing the message class selected, a list of tokens with associated token frequencies, and the plurality of tokens to a token frequency component that is configured for the selected message class, wherein the list of tokens are determined for the message class selected; and

using the token frequency component to determine a number of tokens in the plurality of tokens that result in an associated token frequency for each matching token in the list of tokens exceeding a token frequency threshold; and

based on a comparison between the number of tokens exceeding the token frequency threshold to a matched token threshold identifying the received message as a spam message or a non-spam message.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments are directed towards classifying messages as spam using a two phased approach. The first phase employs a statistical classifier to classify messages based on message content. The second phase targets specific message types to capture dynamic characteristics of the messages and identify spam messages using a token frequency based approach. A client component receives messages and sends them to the statistical classifier, which determines a probability that a message belongs to a particular type of class. The statistical classifier further provides other information about a message, including, a token list, and token thresholds. The message class, token list, and thresholds are provided to the second phase where a number of spam tokens in a given message for a given message class are determined. Based on the threshold, the client component then determines whether the message is spam or non-spam.

123 Citations

20 Claims

1. A network device, comprising:
- a transceiver to send and receive data over a network; and
  
  a processor that is operative to perform actions, comprising;
  
  receiving a message;
  
  determining a plurality of tokens from the received message based in part on a text body of the received message;
  
  analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, including a spam message and a non-spam message;
  
  selecting a message class for the received message based on a comparison of the assigned probability values, wherein a probability value is associated with each of the plurality of message classes;
  
  providing the message class selected, a list of tokens with associated token frequencies, and the plurality of tokens to a token frequency component that is configured for the selected message class, wherein the list of tokens are determined for the message class selected; and
  
  using the token frequency component to determine a number of tokens in the plurality of tokens that result in an associated token frequency for each matching token in the list of tokens exceeding a token frequency threshold; and
  
  based on a comparison between the number of tokens exceeding the token frequency threshold to a matched token threshold identifying the received message as a spam message or a non-spam message.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The network device of claim 1, wherein analyzing the plurality of tokens and selecting a message class comprises employing a naï
    - ve Bayesian Classifier.
  - 3. The network device of claim 1, wherein the assigned probability values represent a plurality of complement probability values for each of the plurality of message classes, and wherein selecting the message class further comprises selecting a message class having a lowest complement probability value.
  - 4. The network device of claim 1, wherein the tokens in the list of tokens for the message class selected expire over time.
  - 5. The network device of claim 1, wherein the processor that is operative to perform actions, further comprising employing a white list to further classify the received message.

6. A processor readable storage medium that includes data and instructions, wherein the execution of the instructions on a computing device provides for managing messages by enabling actions, comprising:
- receiving a message;
  
  determining a plurality of tokens from the received message based in part on a text body of the received message;
  
  employing a statistical classifier to;
  
  analyze the plurality of tokens to assign a plurality of probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is assigned to each of the plurality of message classes;
  
  select a message class for the received message based on a comparison of the plurality of probability values; and
  
  employing a token frequency component configured based on the selected message class to determine a number of tokens in the plurality of tokens that result in a respective token count for each token in a list of tokens for the selected message class exceeding a token frequency threshold;
  
  performing a comparison between the number of tokens to a matched token threshold to identify the received message as a spam message or a non-spam message; and
  
  employing the message identification as spam or non-spam to at least one of tagging the message, or distributing the message into a message folder.
- View Dependent Claims (7, 8, 9, 10, 11)
- - 7. The processor readable storage medium of claim 6, wherein the statistical classifier is configured to operate as at least one of a Bayesian classifier, a Support-Vector machine, logistic regression classifier, perceptron, a Markovian discrimination classifier, a neural network, or a decision tree.
  - 8. The processor readable storage medium of claim 6, wherein the statistical classifier is further configured to employ at least one of a length normalization, term frequency transformation, or an inverse document transformation in assigning a plurality of probability values.
  - 9. The processor readable storage medium of claim 6, wherein determining a number of tokens in the plurality of tokens that result in a respective token count in a list of tokens exceeding a token frequency threshold further comprises:
    - modifying a token count for each token in the list of tokens based on the plurality of tokens;
      
      comparing the resulting modified token count to the token frequency threshold to determine if the modified token count exceeds the token frequency threshold, and if so, incrementing the number of tokens that exceed the token frequency threshold.
  - 10. The processor readable storage medium of claim 6, wherein each token in the list of tokens is configured to expire over time such that a given token associated with the message class is removed from the list of tokens after a defined time period.
  - 11. The processor readable storage medium of claim 6, wherein the plurality of probability values are complement probability values.

12. A method for managing a message delivery, comprising:
- receiving a message;
  
  determining a plurality of tokens from the received message based in part on a text body of the received message;
  
  analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is associated with each of the plurality of message classes;
  
  selecting a message class for the received message based on a comparison of the assigned probability values;
  
  determining a number of tokens in the plurality of tokens that result in a respective token count for each token in a list of tokens for the selected message class exceeding a token frequency threshold; and
  
  based on a comparison between the number of tokens that exceed the token frequency threshold to a matched token threshold for the message class, identifying the received message as a spam message or a non-spam message.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The method of claim 12, wherein analyzing and selecting are performed by a complement naï
    - ve Bayesian classifier.
  - 14. The method of claim 12, wherein determining a number of tokens further comprises employing a maximum multiplier to limit a token count.
  - 15. The method of claim 12, wherein the tokens in the list of tokens are configured to expire over time.
  - 16. The method of claim 12, wherein a statistical classifier is used is to assign probability values is trained based on messages of known message classes, and wherein the training includes employing at least one of a length normalization for the messages, term frequency transformation, or an inverse message transformation.

17. A system for enabling a communications over a network, comprising:
- a statistical classifier that is configured to perform actions, including;
  
  determining a plurality of tokens from the received message based in part on a text body of the received message;
  
  analyzing the plurality of tokens to assign probability values that the received message is classifiable as one of a plurality of message classes, wherein a probability value is assigned to each of the plurality of message classes; and
  
  selecting a message class for the received message based on a comparison of the determined plurality of probability values; and
  
  a token frequency component that is configured to perform actions, including;
  
  receiving the message class selected, a list of tokens and associated count of each token in the list of tokens, a token frequency threshold, and the plurality of tokens;
  
  determining a number of tokens for the received message that result in a respective token count in the list of tokens for the selected message class exceeding the token frequency threshold; and
  
  providing the number of tokens exceeding the token frequency threshold to a client component, such that the client component can performing a comparison between the number of tokens to a matched token threshold for the message class to identify the received message as a spam message or a non-spam message.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17, wherein the statistical classifier comprises at least one of a a Bayesian classifier, a Support-Vector machine, logistic regression classifier, perceptron, a Markovian discrimination classifier, a neural network, or a decision tree.
  - 19. The system of claim 17, wherein determining the count of the number of matched tokens further comprises:
    - identifying tokens in the received plurality of tokens associated with the received message that substantially match a token in the list of tokens;
      
      for each matching token, incrementing the count for the respective token in the list of tokens; and
      
      if the count for token exceeds the token frequency threshold, incrementing the number of tokens.
  - 20. The system of claim 19, wherein the tokens in the list of tokens are configured to expire over time for the selected message class.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Yahoo Assets LLC
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Wei, Stanley Ke, Narayan, Sharat, Kundu, Anirban, Zheng, Lei, Ramarao, Vishwanath Tumkur, Risher, Mark E.

Granted Patent

US 8,364,766 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/52
CPC Class Codes

G06N 7/01 Probabilistic graphical mod...

H04L 51/212 using filtering or selectiv...

SPAM FILTERING BASED ON STATISTICS AND TOKEN FREQUENCY MODELING

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

123 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPAM FILTERING BASED ON STATISTICS AND TOKEN FREQUENCY MODELING

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

123 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links