Learning framework for online applications

US 7,996,897 B2
Filed: 01/23/2008
Issued: 08/09/2011
Est. Priority Date: 01/23/2008
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for detecting spam messages, comprising:

determining a first stage probability of whether a received message is a spam message, wherein the first stage probability is determined by evaluating the received message in relation to a subset of test messages, wherein each subset test message in the subset of test messages was previously identified as either valid or spam;

receiving an indication that a first stage classifier is unsure, based on the first stage probability, as to whether the received message is a spam message;

determining that the first stage probability is greater than a lower limit for combining probabilities and is less than an upper limit for combining probabilities, wherein the lower limit for combining probabilities indicates a probability value below which the first stage probability will not be combined with a second stage probability to determine whether the received message is a spam message, and wherein the upper limit for combining probabilities indicates a probability value above which the received message is marked as a spam message without combining the first stage probability with the second stage probability and wherein the determining the lower limit is determined by;

setting the lower limit to an initial value;

counting correctly identified spam messages from a randomized test set of messages;

counting correctly identified valid messages from the randomized test set of messages;

counting incorrectly identified spam messages from the randomized test set of messages;

counting incorrectly identified valid messages from the randomized test set of messages;

calculating, for each of multiple incremental values of the lower limit, a lower limit classification ratio as a ratio of;

a first sum of;

the count of the correctly identified spam messages; and

the count of the correctly identified valid messages;

over a second sum of;

the count of the incorrectly identified spam messages;

the count of the incorrectly identified valid messages; and

one; and

selecting the incremental value of the lower limit that corresponds to the highest value of the lower limit classification ratio;

determining a second stage probability of whether the received message is a spam message, wherein the second stage probability is determined by evaluating the received message in relation to a subset-specific master set of test messages, which includes the subset of test messages, wherein each subset-specific master set test message in the subset-specific master set of test messages was previously identified as either valid or spam;

computing a combined probability based on the first stage probability and the second stage probability;

determining that the combined probability is greater than a threshold probability at which a threshold classification ratio is highest, wherein the classification ratio comprises a ratio of correctly identified spam messages over incorrectly identified spam messages.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Learning to, and detecting spam messages using a multi-stage combination of probability calculations based on individual and aggregate training sets of previously identified messages. During a preliminary phase, classifiers are trained, lower and upper limit probabilities, and a combined probability threshold are iteratively determined using a multi-stage combination of probability calculations based on minor and major subsets of messages previously categorized as valid or spam. During a live phase, a first stage classifier uses only a particular subset, and a second stage classifier uses a master set of previously categorized messages. If a newly received message can not be categorized with certainty by the first stage classifier, and a computed first stage probability is within the previously determined lower and upper limits, first and second stage probabilities are combined. If the combined probability is greater than the previously determined combined probability threshold, the received message is marked as spam.

14 Citations

View as Search Results

16 Claims

1. A computer implemented method for detecting spam messages, comprising:
- determining a first stage probability of whether a received message is a spam message, wherein the first stage probability is determined by evaluating the received message in relation to a subset of test messages, wherein each subset test message in the subset of test messages was previously identified as either valid or spam;
  
  receiving an indication that a first stage classifier is unsure, based on the first stage probability, as to whether the received message is a spam message;
  
  determining that the first stage probability is greater than a lower limit for combining probabilities and is less than an upper limit for combining probabilities, wherein the lower limit for combining probabilities indicates a probability value below which the first stage probability will not be combined with a second stage probability to determine whether the received message is a spam message, and wherein the upper limit for combining probabilities indicates a probability value above which the received message is marked as a spam message without combining the first stage probability with the second stage probability and wherein the determining the lower limit is determined by;
  
  setting the lower limit to an initial value;
  
  counting correctly identified spam messages from a randomized test set of messages;
  
  counting correctly identified valid messages from the randomized test set of messages;
  
  counting incorrectly identified spam messages from the randomized test set of messages;
  
  counting incorrectly identified valid messages from the randomized test set of messages;
  
  calculating, for each of multiple incremental values of the lower limit, a lower limit classification ratio as a ratio of;
  
  a first sum of;
  
  the count of the correctly identified spam messages; and
  
  the count of the correctly identified valid messages;
  
  over a second sum of;
  
  the count of the incorrectly identified spam messages;
  
  the count of the incorrectly identified valid messages; and
  
  one; and
  
  selecting the incremental value of the lower limit that corresponds to the highest value of the lower limit classification ratio;
  
  determining a second stage probability of whether the received message is a spam message, wherein the second stage probability is determined by evaluating the received message in relation to a subset-specific master set of test messages, which includes the subset of test messages, wherein each subset-specific master set test message in the subset-specific master set of test messages was previously identified as either valid or spam;
  
  computing a combined probability based on the first stage probability and the second stage probability;
  
  determining that the combined probability is greater than a threshold probability at which a threshold classification ratio is highest, wherein the classification ratio comprises a ratio of correctly identified spam messages over incorrectly identified spam messages.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the subset of test messages comprises subset test messages that were each identified as either valid or spam by a particular message recipient.
  - 3. The method of claim 1, wherein the subset-specific master set of test messages comprises subset-specific master set test messages that were identified as either valid or spam by a plurality of message recipients, including a particular message recipient.
  - 4. The method of claim 1, wherein at least one of the first stage classifier and the second stage classifier comprise a Bayesian filter.
  - 5. The method of claim 1, wherein the first stage classifier is unsure if the first stage probability is:
    - greater than an empirically determined lower threshold; and
      
      less than an empirically determined upper threshold.
  - 6. The method of claim 1, wherein counting correctly identified spam messages comprises:
    - using the first stage classifier to determine a preliminary phase user probability that a randomized test set message is a spam message;
      
      using the second stage classifier to determine a preliminary phase master probability that the randomized test set message is a spam message;
      
      computing a preliminary phase combined probability based on the preliminary phase user probability and the preliminary phase master probability;
      
      determining that the preliminary phase combined probability is greater than the threshold probability; and
      
      determining that the randomized test set message was previously identified as spam.
  - 7. The method of claim 1, further comprising determining the upper limit for combining probabilities by performing the steps of claim 5, except replacing the lower limit with the upper limit;
    - replacing the initial value with the lower limit; and
      
      replacing the lower limit classification ratio with an upper limit classification ratio.
  - 8. The method of claim 1, wherein the combined probability is computed according to the following formula:
    - (Pspam1+UCF*Pspam2)/2,wherein;
      
      Pspam1 is the first stage probability;
      
      Pspam2 is the second stage probability; and
      
      UCF is a user confidence factor for weighting the second stage probability.
  - 9. The method of claim 1, wherein prior to determining the first stage probability, further comprising:
    - training the second stage classifier based on a randomized master set of test messages, each of which is pre-classified as either valid or spam;
      
      determining the threshold probability based on a threshold classification ratio of correct spam counts and incorrect spam counts determined through an evaluation of the subset of test messages with the trained second stage classifier and the first stage classifier;
      
      determining the lower limit for combining probabilities based on a lower limit classification ratio as a function of correct spam counts, correct valid message counts, incorrect spam counts, and incorrect valid message counts determined through an evaluation of the subset of test messages with the trained second stage classifier and the first stage classifier;
      
      determining the upper limit for combining probabilities based on an upper limit classification ratio as a function of correct spam counts, correct valid message counts, incorrect spam counts, and incorrect valid message counts determined through an evaluation of the subset of test messages with the trained second stage classifier and the first stage classifier;
      
      training the first stage classifier based on the subset of test messages; and
      
      retraining the second stage classifier based on a subset-specific master set of test messages that includes the subset of test messages.
  - 10. A computer readable storage medium that is not a signal, storing computer readable instructions that cause a computing device to perform the steps of claim 1.

11. A system for detecting spam messages, comprising:
- a processor;
  
  a communication interface in communication with the processor and in communication with an electronic network;
  
  a memory in communication with the processor and storing computer readable instructions that cause the processor to perform a plurality of operations, including;
  
  determining a first stage probability of whether a received message is a spam message, wherein the first stage probability is determined by evaluating the received message in relation to a subset of test messages, wherein each subset test message in the subset of test messages was previously identified as either valid or spam;
  
  receiving an indication that a first stage classifier is unsure, based on the first stage probability, as to whether the received message is an spam message;
  
  determining that the first stage probability is greater than a lower limit for combining probabilities and is less than an upper limit for combining probabilities, wherein the lower limit for combining probabilities indicates a probability value below which the first stage probability will not be combined with a second stage probability to determine whether the received message is a spam message, and wherein the upper limit for combining probabilities indicates a probability value above which the received message is marked as a spam message without combining the first stage probability with the second stage probability, and wherein the determining the lower limit is determined by;
  
  setting the lower limit to an initial value;
  
  counting correctly identified spam messages from a randomized test set of messages;
  
  counting correctly identified valid messages from the randomized test set of messages;
  
  counting incorrectly identified spam messages from the randomized test set of messages;
  
  ;
  
  counting incorrectly identified valid messages from the randomized test set of calculating, for each of multiple incremental values of the lower limit, a lower limit classification ratio as a ratio of;
  
  a first sum of;
  
  the count of the correctly identified spam messages andthe count of the correctly identified valid messages;
  
  over a second sum of;
  
  the count of the incorrectly identified spam messagesthe count of the incorrectly identified valid messages; and
  
  one; and
  
  selecting the incremental value of the lower limit that corresponds to the highest value of the lower limit classification ratio;
  
  determining a second stage probability of whether the received message is a spam message, wherein the second stage probability is determined by evaluating the received message in relation to a master set of test messages, which includes the subset of test messages, wherein each master set test message in the master set of test messages was previously identified as either valid or spurn;
  
  computing a combined probability based on the first stage probability and the second stage probability;
  
  determining that the combined probability is greater than a threshold probability at which a threshold classification ratio is highest, wherein the classification ratio comprises a ratio of correctly identified spam messages over incorrectly identified spam messages.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system of claim 11, wherein the subset of test messages comprises subset test messages that were each identified by a message recipient as either valid or spam.
  - 13. The system of claim 11, wherein the master set of test messages comprises master set test messages that were identified by a plurality of message recipients as either valid or spam.
  - 14. The system of claim 11, wherein the computer readable instructions cause the processor to perform further operations for counting correctly identified spam messages, including:
    - using the first stage classifier to determine a preliminary stage user probability that a statistical test set message is a spam message;
      
      using the second stage classifier to determine a preliminary stage master probability that the statistical test set message is a spam message;
      
      computing a preliminary stage combined probability based on the preliminary stage user probability and the preliminary stage master probability;
      
      determining that the preliminary stage combined probability is greater than the threshold probability; and
      
      determining that the statistical test set message was previously identified as spam.
  - 15. The system of claim 11, wherein the combined probability is computed according to the following formula:
    - (Pspam1+UCF*Pspam2)/2, wherein;
      
      Pspam1 is the first stage probability;
      
      Pspam2 is the second stage probability; and
      
      UCF is a user confidence factor for weighting the second stage probability.
  - 16. The system of claim 11, wherein prior to determining the first stage probability, the computer readable instructions cause the processor to perform further operations, including:
    - training the second stage classifier with randomly sampled messages, each of which is pre-classified as either valid or spam;
      
      determining the threshold probability based on a threshold classification ratio of correct spam counts and incorrect spam counts determined through an evaluation of the subset of test messages with the trained second stage classifier and the first stage classifier;
      
      determining the lower limit for combining probabilities based on a lower limit classification ratio as a function of correct spam counts, correct valid message counts, incorrect spam counts, and incorrect valid message counts determined through an evaluation of the subset of test messages with the trained second stage classifier and the first stage classifier;
      
      determining the upper limit for combining probabilities based on an upper limit classification ratio as a function of correct spam counts, correct valid message counts, incorrect spam counts, and incorrect valid message counts determined through an evaluation of the subset of test messages with the trained second stage classifier and the first stage classifier;
      
      training the first stage classifier based on the subset of test messages; and
      
      re-training the second stage classifier with the master set of test messages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Jeyaraman, Raghav, Pandey, Abhishek Kumar, Ramarao, Vishwanth Tumkur
Primary Examiner(s)
Pearson; David J

Application Number

US12/011,114
Publication Number

US 20090187987A1
Time in Patent Office

1,294 Days
Field of Search

None
US Class Current

726/22
CPC Class Codes

H04L 51/212 using filtering or selectiv...

Learning framework for online applications

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

14 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Learning framework for online applications

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links