Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set
First Claim
1. A method of classifying an incoming electronic message, as a function of content of the message, into one of a plurality of predefined classes, the method comprising the steps of:
- determining whether each one of a pre-defined set of N features (where N is a predefined integer) is present in the incoming message so as to yield feature data associated with the message;
applying the feature data to a probabilistic classifier so as to yield an output confidence level for the incoming message which specifies a probability that the incoming message belongs to said one class, wherein the classifier has been trained, on past classifications of message content for a plurality of messages that form a training set and belong to said one class, to recognize said N features in the training set;
classifying, in response to a magnitude of the output confidence level, the incoming message as a member of said one class of messages;
automatically updating the training set to include classification of message content for an incoming message which has been classified by a user in another one of the predefined classes other than said one class specified by the classifier so as to form an updated training set; and
automatically re-training the classifier based on the updated training set so as to adapt the operation of the classifier to changes in either message content that affect message classification or in user perceptions of the content of incoming messages.
2 Assignments
0 Petitions
Accused Products
Abstract
A technique, specifically a method and apparatus that implements the method, which through a probabilistic classifier (370) and, for a given recipient, detects electronic mail (e-mail) messages, in an incoming message stream, which that recipient is likely to consider "junk". Specifically, the invention discriminates message content for that recipient, through a probabilistic classifier (e.g., a support vector machine) trained on prior content classifications. Through a resulting quantitative probability measure, i.e., an output confidence level, produced by the classifier for each message and subsequently compared against a predefined threshold, that message is classified as either, e.g., spam or legitimate mail, and, e.g., then stored in a corresponding folder (223, 227) for subsequent retrieval by and display to the recipient. Based on the probability measure, the message can alternatively be classified into one of a number of different folders, depicted in a pre-defined visually distinctive manner or simply discarded in its entirety.
1256 Citations
65 Claims
-
1. A method of classifying an incoming electronic message, as a function of content of the message, into one of a plurality of predefined classes, the method comprising the steps of:
-
determining whether each one of a pre-defined set of N features (where N is a predefined integer) is present in the incoming message so as to yield feature data associated with the message; applying the feature data to a probabilistic classifier so as to yield an output confidence level for the incoming message which specifies a probability that the incoming message belongs to said one class, wherein the classifier has been trained, on past classifications of message content for a plurality of messages that form a training set and belong to said one class, to recognize said N features in the training set; classifying, in response to a magnitude of the output confidence level, the incoming message as a member of said one class of messages; automatically updating the training set to include classification of message content for an incoming message which has been classified by a user in another one of the predefined classes other than said one class specified by the classifier so as to form an updated training set; and automatically re-training the classifier based on the updated training set so as to adapt the operation of the classifier to changes in either message content that affect message classification or in user perceptions of the content of incoming messages. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. Apparatus for classifying an incoming electronic message, as a function of content of the message, into one of a plurality of predefined classes, the apparatus comprising:
-
a processor; a memory having computer executable instructions stored therein; wherein, in response to the stored instructions, the processor; determines whether each one of a pre-defined se. of N features (where N is a predefined integer) is present in the incoming message so as to yield feature data associated with the message; applies the feature data to a probabilistic classifier so as to yield an output confidence level for the incoming message which specifies a probability that the incoming message belongs to said one class, wherein the classifier has been trained, on past classifications of message content for a plurality of messages that form a training set and belong to said one class, to recognize said N features in the training set; classifies, in response to a magnitude of the output confidence level, the incoming message as a member of said one class of messages; automatically updates the training set to include classification of message content for an incoming message which has been classified by a user in another one of the predefined classes other than said one class specified by the classifier so as to form an updated training set; and automatically re-trains the classifier based on the updated training set so as to adapt the operation of the classifier to changes in either message content that affect message classification or in user perceptions of the content of incoming messages. - View Dependent Claims (35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65)
-
Specification