Classifier tuning based on data similarities
First Claim
1. A machine readable medium storing one or more programs that implement an e-mail classifier for determining whether at least one received e-mail should be classified as spam, the one or more programs comprising instructions for causing one or more processing devices to perform the following operations:
- obtain feature data for the received e-mail by determining whether the received e-mail has a predefined set of features;
train a scoring classifier using a set of unique training e-mails;
provide a classification output, using the scoring classifier, based on the obtained feature data, wherein the classification output is indicative of whether or not the received e-mail is spam;
compare the provided classification output to a classification threshold, wherein the received e-mail is classified as spam when the comparison of the classification output to the classification threshold indicates the received e-mail is spam;
determine at least one similarity rate for at least one e-mail, wherein the at least one similarity rate is the rate at which e-mails, which are substantially similar to the at least one e-mail, are received by the e-mail classifier;
select and set a value for the classification threshold, wherein selecting and setting the value for the classification threshold includes;
selecting and setting an initial value for the classification threshold that reduces misclassification costs based on a set of unique evaluation e-mails; and
selecting and setting a new value for the classification threshold that reduces the misclassification costs based at least on the determined at least one similarity rate.
10 Assignments
0 Petitions
Accused Products
Abstract
A probabilistic classifier is used to classify data items in a data stream. The probabilistic classifier is trained, and an initial classification threshold is set, using unique training and evaluation data sets (i.e., data sets that do not contain duplicate data items). Unique data sets are used for training and in setting the initial classification threshold so as to prevent the classifier from being improperly biased as a result of similarity rates in the training and evaluation data sets that do not reflect similarity rates encountered during operation. During operation, information regarding the actual similarity rates of data items in the data stream is obtained and used to adjust the classification threshold such that misclassification costs are minimized given the actual similarity rates.
-
Citations
27 Claims
-
1. A machine readable medium storing one or more programs that implement an e-mail classifier for determining whether at least one received e-mail should be classified as spam, the one or more programs comprising instructions for causing one or more processing devices to perform the following operations:
-
obtain feature data for the received e-mail by determining whether the received e-mail has a predefined set of features; train a scoring classifier using a set of unique training e-mails; provide a classification output, using the scoring classifier, based on the obtained feature data, wherein the classification output is indicative of whether or not the received e-mail is spam; compare the provided classification output to a classification threshold, wherein the received e-mail is classified as spam when the comparison of the classification output to the classification threshold indicates the received e-mail is spam; determine at least one similarity rate for at least one e-mail, wherein the at least one similarity rate is the rate at which e-mails, which are substantially similar to the at least one e-mail, are received by the e-mail classifier; select and set a value for the classification threshold, wherein selecting and setting the value for the classification threshold includes; selecting and setting an initial value for the classification threshold that reduces misclassification costs based on a set of unique evaluation e-mails; and selecting and setting a new value for the classification threshold that reduces the misclassification costs based at least on the determined at least one similarity rate. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for determining whether at least one received e-mail should be classified as spam, the method comprising:
-
obtaining feature data for the received e-mail by determining whether the received e-mail has a predefined set of features; training a scoring classifier using a set of unique training e-mails; providing a classification output, using the scoring classifier, based on the obtained feature data, wherein the classification output is indicative of whether or not the received e-mail is spam; comparing the provided classification output to a classification threshold, wherein the received e-mail is classified as spam when the comparison of the classification output to the classification threshold indicates the received e-mail is spam; determining at least one similarity rate for at least one e-mail, wherein the at least one similarity rate is the rate at which e-mails, which are substantially similar to the at least one e-mail, are received by an e-mail classifier; selecting and setting a value for the classification threshold, wherein selecting and setting the value for the classification threshold includes; selecting and setting an initial value for the classification threshold that reduces misclassification costs based on a set of unique evaluation e-mails; and selecting and setting a new value for the classification threshold that reduces the misclassification costs based at least on the determined at least one similarity rate. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An e-mail server that determines whether at least one received e-mail should be classified as spam, the e-mail server comprising:
one or more processing devices configured to implement the following operations; obtain feature data for the received e-mail by determining whether the received e-mail has a predefined set of features; train a scoring classifier using a set of unique training e-mails; provide a classification output, using the scoring classifier, based on the obtained feature data, wherein the classification output is indicative of whether or not the received e-mail is spam; compare the provided classification output to a classification threshold, wherein the received e-mail is classified as spam when the comparison of the classification output to the classification threshold indicates the received e-mail is spam; determine at least one similarity rate for at least one e-mail, wherein the at least one similarity rate is the rate at which e-mails, which are substantially similar to the at least one e-mail, are received by an e-mail classifier; select and set a value for the classification threshold, wherein selecting and setting the value for the classification threshold includes; selecting and setting an initial value for the classification threshold that reduces misclassification costs based on a set of unique evaluation e-mails; and selecting and setting a new value for the classification threshold that reduces the misclassification costs based at least on the determined at least one similarity rate. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
Specification