Learning framework for online applications
First Claim
1. A computer implemented method for detecting spam messages, comprising:
- determining a first stage probability of whether a received message is a spam message, wherein the first stage probability is determined by evaluating the received message in relation to a subset of test messages, wherein each subset test message in the subset of test messages was previously identified as either valid or spam;
receiving an indication that a first stage classifier is unsure, based on the first stage probability, as to whether the received message is a spam message;
determining that the first stage probability is greater than a lower limit for combining probabilities and is less than an upper limit for combining probabilities, wherein the lower limit for combining probabilities indicates a probability value below which the first stage probability will not be combined with a second stage probability to determine whether the received message is a spam message, and wherein the upper limit for combining probabilities indicates a probability value above which the received message is marked as a spam message without combining the first stage probability with the second stage probability and wherein the determining the lower limit is determined by;
setting the lower limit to an initial value;
counting correctly identified spam messages from a randomized test set of messages;
counting correctly identified valid messages from the randomized test set of messages;
counting incorrectly identified spam messages from the randomized test set of messages;
counting incorrectly identified valid messages from the randomized test set of messages;
calculating, for each of multiple incremental values of the lower limit, a lower limit classification ratio as a ratio of;
a first sum of;
the count of the correctly identified spam messages; and
the count of the correctly identified valid messages;
over a second sum of;
the count of the incorrectly identified spam messages;
the count of the incorrectly identified valid messages; and
one; and
selecting the incremental value of the lower limit that corresponds to the highest value of the lower limit classification ratio;
determining a second stage probability of whether the received message is a spam message, wherein the second stage probability is determined by evaluating the received message in relation to a subset-specific master set of test messages, which includes the subset of test messages, wherein each subset-specific master set test message in the subset-specific master set of test messages was previously identified as either valid or spam;
computing a combined probability based on the first stage probability and the second stage probability;
determining that the combined probability is greater than a threshold probability at which a threshold classification ratio is highest, wherein the classification ratio comprises a ratio of correctly identified spam messages over incorrectly identified spam messages.
9 Assignments
0 Petitions
Accused Products
Abstract
Learning to, and detecting spam messages using a multi-stage combination of probability calculations based on individual and aggregate training sets of previously identified messages. During a preliminary phase, classifiers are trained, lower and upper limit probabilities, and a combined probability threshold are iteratively determined using a multi-stage combination of probability calculations based on minor and major subsets of messages previously categorized as valid or spam. During a live phase, a first stage classifier uses only a particular subset, and a second stage classifier uses a master set of previously categorized messages. If a newly received message can not be categorized with certainty by the first stage classifier, and a computed first stage probability is within the previously determined lower and upper limits, first and second stage probabilities are combined. If the combined probability is greater than the previously determined combined probability threshold, the received message is marked as spam.
14 Citations
16 Claims
-
1. A computer implemented method for detecting spam messages, comprising:
-
determining a first stage probability of whether a received message is a spam message, wherein the first stage probability is determined by evaluating the received message in relation to a subset of test messages, wherein each subset test message in the subset of test messages was previously identified as either valid or spam; receiving an indication that a first stage classifier is unsure, based on the first stage probability, as to whether the received message is a spam message; determining that the first stage probability is greater than a lower limit for combining probabilities and is less than an upper limit for combining probabilities, wherein the lower limit for combining probabilities indicates a probability value below which the first stage probability will not be combined with a second stage probability to determine whether the received message is a spam message, and wherein the upper limit for combining probabilities indicates a probability value above which the received message is marked as a spam message without combining the first stage probability with the second stage probability and wherein the determining the lower limit is determined by; setting the lower limit to an initial value; counting correctly identified spam messages from a randomized test set of messages; counting correctly identified valid messages from the randomized test set of messages; counting incorrectly identified spam messages from the randomized test set of messages; counting incorrectly identified valid messages from the randomized test set of messages; calculating, for each of multiple incremental values of the lower limit, a lower limit classification ratio as a ratio of; a first sum of; the count of the correctly identified spam messages; and the count of the correctly identified valid messages;
over a second sum of;the count of the incorrectly identified spam messages; the count of the incorrectly identified valid messages; and
one; andselecting the incremental value of the lower limit that corresponds to the highest value of the lower limit classification ratio; determining a second stage probability of whether the received message is a spam message, wherein the second stage probability is determined by evaluating the received message in relation to a subset-specific master set of test messages, which includes the subset of test messages, wherein each subset-specific master set test message in the subset-specific master set of test messages was previously identified as either valid or spam; computing a combined probability based on the first stage probability and the second stage probability; determining that the combined probability is greater than a threshold probability at which a threshold classification ratio is highest, wherein the classification ratio comprises a ratio of correctly identified spam messages over incorrectly identified spam messages. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for detecting spam messages, comprising:
-
a processor; a communication interface in communication with the processor and in communication with an electronic network; a memory in communication with the processor and storing computer readable instructions that cause the processor to perform a plurality of operations, including; determining a first stage probability of whether a received message is a spam message, wherein the first stage probability is determined by evaluating the received message in relation to a subset of test messages, wherein each subset test message in the subset of test messages was previously identified as either valid or spam; receiving an indication that a first stage classifier is unsure, based on the first stage probability, as to whether the received message is an spam message; determining that the first stage probability is greater than a lower limit for combining probabilities and is less than an upper limit for combining probabilities, wherein the lower limit for combining probabilities indicates a probability value below which the first stage probability will not be combined with a second stage probability to determine whether the received message is a spam message, and wherein the upper limit for combining probabilities indicates a probability value above which the received message is marked as a spam message without combining the first stage probability with the second stage probability, and wherein the determining the lower limit is determined by; setting the lower limit to an initial value; counting correctly identified spam messages from a randomized test set of messages; counting correctly identified valid messages from the randomized test set of messages; counting incorrectly identified spam messages from the randomized test set of messages;
;counting incorrectly identified valid messages from the randomized test set of calculating, for each of multiple incremental values of the lower limit, a lower limit classification ratio as a ratio of; a first sum of; the count of the correctly identified spam messages and the count of the correctly identified valid messages;
over a second sum of;the count of the incorrectly identified spam messages the count of the incorrectly identified valid messages; and
one; andselecting the incremental value of the lower limit that corresponds to the highest value of the lower limit classification ratio; determining a second stage probability of whether the received message is a spam message, wherein the second stage probability is determined by evaluating the received message in relation to a master set of test messages, which includes the subset of test messages, wherein each master set test message in the master set of test messages was previously identified as either valid or spurn; computing a combined probability based on the first stage probability and the second stage probability; determining that the combined probability is greater than a threshold probability at which a threshold classification ratio is highest, wherein the classification ratio comprises a ratio of correctly identified spam messages over incorrectly identified spam messages. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification