Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set

US 6,161,130 A
Filed: 06/23/1998
Issued: 12/12/2000
Est. Priority Date: 06/23/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method of classifying an incoming electronic message, as a function of content of the message, into one of a plurality of predefined classes, the method comprising the steps of:

determining whether each one of a pre-defined set of N features (where N is a predefined integer) is present in the incoming message so as to yield feature data associated with the message;

applying the feature data to a probabilistic classifier so as to yield an output confidence level for the incoming message which specifies a probability that the incoming message belongs to said one class, wherein the classifier has been trained, on past classifications of message content for a plurality of messages that form a training set and belong to said one class, to recognize said N features in the training set;

classifying, in response to a magnitude of the output confidence level, the incoming message as a member of said one class of messages;

automatically updating the training set to include classification of message content for an incoming message which has been classified by a user in another one of the predefined classes other than said one class specified by the classifier so as to form an updated training set; and

automatically re-training the classifier based on the updated training set so as to adapt the operation of the classifier to changes in either message content that affect message classification or in user perceptions of the content of incoming messages.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique, specifically a method and apparatus that implements the method, which through a probabilistic classifier (370) and, for a given recipient, detects electronic mail (e-mail) messages, in an incoming message stream, which that recipient is likely to consider "junk". Specifically, the invention discriminates message content for that recipient, through a probabilistic classifier (e.g., a support vector machine) trained on prior content classifications. Through a resulting quantitative probability measure, i.e., an output confidence level, produced by the classifier for each message and subsequently compared against a predefined threshold, that message is classified as either, e.g., spam or legitimate mail, and, e.g., then stored in a corresponding folder (223, 227) for subsequent retrieval by and display to the recipient. Based on the probability measure, the message can alternatively be classified into one of a number of different folders, depicted in a pre-defined visually distinctive manner or simply discarded in its entirety.

1256 Citations

65 Claims

1. A method of classifying an incoming electronic message, as a function of content of the message, into one of a plurality of predefined classes, the method comprising the steps of:
- determining whether each one of a pre-defined set of N features (where N is a predefined integer) is present in the incoming message so as to yield feature data associated with the message;
  
  applying the feature data to a probabilistic classifier so as to yield an output confidence level for the incoming message which specifies a probability that the incoming message belongs to said one class, wherein the classifier has been trained, on past classifications of message content for a plurality of messages that form a training set and belong to said one class, to recognize said N features in the training set;
  
  classifying, in response to a magnitude of the output confidence level, the incoming message as a member of said one class of messages;
  
  automatically updating the training set to include classification of message content for an incoming message which has been classified by a user in another one of the predefined classes other than said one class specified by the classifier so as to form an updated training set; and
  
  automatically re-training the classifier based on the updated training set so as to adapt the operation of the classifier to changes in either message content that affect message classification or in user perceptions of the content of incoming messages.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 2. The method in claim 1 wherein the classes comprise first and second classes for first and second predefined categories of messages, respectively.
  - 3. The method in claim 2 wherein the classes comprise a plurality of sub-classes and said one class is one of said sub-classes.
  - 4. The method in claim 2 further comprising the steps of:
    - comparing the output confidence level for the incoming message to a predefined probabilistic threshold value so as to yield a comparison result; and
      
      distinguishing said incoming message, in a predefined manner associated with the first class, from messages associated with the second class if the comparison result indicates that the output confidence level equals or exceeds the threshold level.
  - 5. The method in claim 3 wherein the predefined manner comprises storing the first and second classes of messages in separate corresponding folders, or providing a predefined visual indication that said incoming message is a member of the first class.
  - 6. The method in claim 5 wherein said indication is a predefined color coding of all or a portion of the incoming message.
  - 7. The method in claim 6 wherein a color of said color coding varies with the confidence level that the incoming message is a member of the first class.
  - 8. The method in claim 4 further comprising the steps of:
    - detecting whether each of a first group of predefined handcrafted features exists in the incoming message so as to yield first output data;
      
      analyzing text in the incoming message so as to break the text into a plurality of constituent tokens;
      
      ascertaining, using a word-oriented indexer and in response to said tokens, whether each of a second group of predefined word-oriented features exists in the incoming message so as to yield second output data, said first and second groups collectively defining an n-element feature space (where n is an integer greater than N);
      
      forming, in response to the first and second output data, an N-element feature vector which specifies whether each of said N features exists in the incoming message; and
      
      applying the feature vector as input to the probabilistic classifier so as to yield the output confidence level for the incoming message.
  - 9. The method in claim 8 wherein the feature space comprises both word-based and handcrafted features.
  - 10. The method in claim 8 wherein the classes comprise a plurality of sub-classes and said one class is one of said sub-classes.
  - 11. The method in claim 8 wherein the message is an electronic mail (e-mail) message and said first and second classes are non-legitimate and legitimate messages, respectively.
  - 12. The method in claim 9 wherein the handcrafted features comprise features correspondingly related to formatting, authoring, delivery or communication attributes that characterize a message as belonging to the first class.
  - 13. The method in claim 12 wherein the formatting attributes comprises whether a predefined word in the text of the incoming message is capitalized, or whether the text of the incoming message contains a series of predefined punctuation marks.
  - 14. The method in claim 12 wherein the delivery attributes comprise whether the incoming message contains an address of a single recipient or addresses of plurality of recipients, or a time at which the incoming message was transmitted.
  - 15. The method in claim 12 wherein the authoring attributes comprise whether the incoming message contains an address of a single recipient, or contains addresses of plurality of recipients or contains no sender at all, or a time at which the incoming message was transmitted.
  - 16. The method in claim 12 wherein the communication attributes comprise whether the incoming message has an attachment, or whether the message was sent from a predefined domain type.
  - 17. The method in claim 8 wherein the probabilistic classifier comprises a Naive Bayesian classifier, a limited dependence Bayesian classifier, a Bayesian network classifier, a decision tree, a support vector machine, or is implemented through use of content matching.
  - 18. The method in claim 17 wherein:
    - the feature data applying step comprises the step of yielding the output confidence level for said incoming message through a support vector machine; and
      
      the comparing step comprises the step of thresholding the output confidence level through a predefined sigmoid function to produce the comparison result for the incoming message.
  - 19. The method in claim 4 further comprises a training phase having the steps of:
    - detecting whether each one of a plurality of predetermined features exists in each message of a training set of m messages belonging to the first class so as to yield a feature matrix containing feature data for all of the training messages, wherein the plurality of predetermined features defines a predefined n-element feature space and each of the training messages has been previously classified as belonging to the first class;
      
      reducing the feature matrix in size to yield a reduced feature matrix having said N features (where n, N and m are integers with n>
      
      N); and
      
      applying the reduced feature matrix and the known classifications of each of said training messages to the classifier and training the classifier to recognize the N features in the m-message training set.
  - 20. The method in claim 19 wherein said indication is a predefined color coding of all or a portion of the incoming message.
  - 21. The method in claim 20 wherein a color of said color coding varies with the confidence level that the incoming message is a member of the first class.
  - 22. The method of claim 19 further comprising the step of utilizing messages in the first class as the training set.
  - 23. The method in claim 19 wherein the reducing step comprises the steps of:
    - eliminating all features from the feature matrix, that occur less than a predefined amount in the training set, so as to yield a partially reduced feature matrix;
      
      determining a mutual information measure for all remaining features in the partially reduced feature matrix;
      
      selecting, from all the remaining features in the partially reduced matrix, the N features that have highest corresponding quantitative mutual information measures; and
      
      forming the reduced feature matrix containing an associated data value for each of the N features and for each of the m training messages.
  - 24. The method in claim 19 wherein the feature space comprises both word-oriented and handcrafted features.
  - 25. The method in claim 19 wherein the classes comprise a plurality of sub-classes and said one class is one of said sub-classes.
  - 26. The method in claim 24 wherein the message is an electronic mail (e-mail) message and said first and second classes are non-legitimate and legitimate messages, respectively.
  - 27. The method in claim 26 wherein the handcrafted features comprise features correspondingly related to formatting, authoring, delivery or communication attributes that characterize an e-mail message as belonging to the first class.
  - 28. The method in claim 27 wherein the formatting attributes comprises whether a predefined word in the text of the incoming message is capitalized, or whether the text of the incoming message contains a series of predefined punctuation marks.
  - 29. The method in claim 27 wherein the delivery attributes comprise whether the incoming message contains an address of a single recipient or addresses of plurality of recipients, or a time at which the incoming message was transmitted.
  - 30. The method in claim 27 wherein the authoring attributes comprise whether the incoming message contains an address of a single recipient, or contains addresses of plurality of recipients or contains no sender at all, or a time at which the incoming message was transmitted.
  - 31. The method in claim 27 wherein the communication attributes comprise whether the incoming message has an attachment, or whether the message was sent from a predefined domain type.
  - 32. The method in claim 8 further comprising the step of updating, from a remote server, the probabilistic classifier and definitions of features associated with the first class.
  - 33. A computer readable medium having computer executable instructions stored therein for performing the steps of claim 1.

34. Apparatus for classifying an incoming electronic message, as a function of content of the message, into one of a plurality of predefined classes, the apparatus comprising:
- a processor;
  
  a memory having computer executable instructions stored therein;
  
  wherein, in response to the stored instructions, the processor;
  
  determines whether each one of a pre-defined se. of N features (where N is a predefined integer) is present in the incoming message so as to yield feature data associated with the message;
  
  applies the feature data to a probabilistic classifier so as to yield an output confidence level for the incoming message which specifies a probability that the incoming message belongs to said one class, wherein the classifier has been trained, on past classifications of message content for a plurality of messages that form a training set and belong to said one class, to recognize said N features in the training set;
  
  classifies, in response to a magnitude of the output confidence level, the incoming message as a member of said one class of messages;
  
  automatically updates the training set to include classification of message content for an incoming message which has been classified by a user in another one of the predefined classes other than said one class specified by the classifier so as to form an updated training set; and
  
  automatically re-trains the classifier based on the updated training set so as to adapt the operation of the classifier to changes in either message content that affect message classification or in user perceptions of the content of incoming messages.
- View Dependent Claims (35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65)
- - 35. The apparatus in claim 34 wherein the classes comprise first and second classes for first and second predefined categories of messages, respectively.
  - 36. The apparatus in claim 35 wherein the classes comprise a plurality of sub-classes and said one class is one of said sub-classes.
  - 37. The apparatus in claim 35 wherein the processor, in response to the stored instructions:
    - compares the output confidence level for the incoming message to a predefined probabilistic threshold value so as to yield a comparison result; and
      
      distinguishes said incoming message, in a predefined manner associated with the first class, from messages associated with the second class if the comparison result indicates that the output confidence level equals or exceeds the threshold level.
  - 38. The apparatus in claim 36 wherein the processor, in response to the stored instructions, implements the predefined manner by storing the first and second classes of messages in separate corresponding folders, or providing a predefined visual indication that said incoming message is a member of the first class.
  - 39. The apparatus in claim 38 wherein said indication is a predefined color coding of all or a portion of the incoming message.
  - 40. The apparatus in claim 39 wherein a color of said color coding varies with the confidence level that the incoming message is a member of the first class.
  - 41. The apparatus in claim 37 wherein the processor, in response to the stored instructions:
    - detects whether each of a first group of predefined handcrafted features exists in the incoming message so as to yield first output data;
      
      analyzes text in the incoming message so as to break the text into a plurality of constituent tokens;
      
      ascertains, using a word-oriented indexer and in response to said tokens, whether each of a second group of predefined word-oriented features exists in the incoming message so as to yield second output data, said first and second groups collectively defining an n-element feature space (where n is an integer greater than N);
      
      forms, in response to the first and second output data, an N-element feature vector which specifies whether each of said N features exists in the incoming message; and
      
      applies the feature vector as input to the probabilistic classifier so as to yield the output confidence level for the incoming message.
  - 42. The apparatus in claim 41 wherein the feature space comprises both word-based and handcrafted features.
  - 43. The apparatus in claim 41 wherein the classes comprise a plurality of sub-classes and said one class is one of said sub-classes.
  - 44. The apparatus in claim 41 wherein the message is an electronic mail (e-mail) message and said first and second classes are non-legitimate and legitimate messages, respectively.
  - 45. The apparatus in claim 42 wherein the handcrafted features comprise features correspondingly related to formatting, authoring, delivery or communication attributes that characterize a message as belonging to the first class.
  - 46. The apparatus in claim 45 wherein the formatting attributes comprises whether a predefined word in the text of the incoming message is capitalized, or whether the text of the incoming message contains a series of predefined punctuation marks.
  - 47. The apparatus in claim 45 wherein the delivery attributes comprise whether the incoming message contains an address of a single recipient or addresses of plurality of recipients, or a time at which the incoming message was transmitted.
  - 48. The apparatus in claim 45 wherein the authoring attributes comprise whether the incoming message contains an address of a single recipient, or contains addresses of plurality of recipients or contains no sender at all, or a time at which the incoming message was transmitted.
  - 49. The apparatus in claim 45 wherein the communication attributes comprise whether the incoming message has an attachment, or whether the message was sent from a predefined domain type.
  - 50. The apparatus in claim 41 wherein the probabilistic classifier comprises a Naive Bayesian classifier, a limited dependence Bayesian classifier, a Bayesian network classifier, a decision tree, a support vector machine, or is implemented through use of content matching.
  - 51. The apparatus in claim 50 wherein the processor, in response to the stored instructions:
    - yields the output confidence level for said incoming message through a support vector machine; and
      
      thresholds the output confidence level through a predefined sigmoid function to produce the comparison result for the incoming message.
  - 52. The apparatus in claim 37 further comprises a training phase wherein the processor, in response to the stored instructions:
    - detects whether each one of a plurality of predetermined features exists in each message of a training set of m messages belonging to the first class so as to yield a feature matrix containing feature data for all of the training messages, wherein the plurality of predetermined features defines a predefined n-element feature space and each of the training messages has been previously classified as belonging to the first class;
      
      reduces the feature matrix in size to yield a reduced feature matrix having said N features (where n, N and m are integers with n>
      
      N); and
      
      applies the reduced feature matrix and the known classifications of each of said training messages to the classifier and training the classifier to recognize the N features in the m-message training set.
  - 53. The apparatus in claim 52 wherein said indication is a predefined color coding of all or a portion of the incoming message.
  - 54. The apparatus in claim 53 wherein a color of said color coding varies with the confidence level that the incoming message is a member of the first class.
  - 55. The apparatus of claim 52 further wherein the processor, in response to the stored instructions, utilizes messages in the first class as the training set.
  - 56. The apparatus in claim 52 wherein the processor, in response to the stored instructions:
    - eliminates all features from the feature matrix, that occur less than a predefined amount in the training set, so as to yield a partially reduced feature matrix;
      
      determines a mutual information measure for all remaining features in the partially reduced feature matrix;
      
      selects, from all the remaining features in the partially reduced matrix, the N features that have highest corresponding quantitative mutual information measures; and
      
      forms the reduced feature matrix containing an associated data value for each of the N features and for each of the m training messages.
  - 57. The apparatus in claim 52 wherein the feature space comprises both word-oriented and handcrafted features.
  - 58. The apparatus in claim 52 wherein the classes comprise a plurality of sub-classes and said one class is one of said sub-classes.
  - 59. The apparatus in claim 57 wherein the message is an electronic mail (e-mail) message and said first and second classes are non-legitimate and legitimate messages, respectively.
  - 60. The apparatus in claim 59 wherein the handcrafted features comprise features correspondingly related to formatting, authoring, delivery or communication attributes that characterize an e-mail message as belonging to the first class.
  - 61. The apparatus in claim 60 wherein the formatting attributes comprises whether a predefined word in the text of the incoming message is capitalized, or whether the text of the incoming message contains a series of predefined punctuation marks.
  - 62. The apparatus in claim 60 wherein the delivery attributes comprise whether the incoming message contains an address of a single recipient or addresses of plurality of recipients, or a time at which the incoming message was transmitted.
  - 63. The apparatus in claim 60 wherein the authoring attributes comprise whether the incoming message contains an address of a single recipient, or contains addresses of plurality of recipients or contains no sender at all, or a time at which the incoming message was transmitted.
  - 64. The apparatus in claim 60 wherein the communication attributes comprise whether the incoming message has an attachment, or whether the message was sent from a predefined domain type.
  - 65. The apparatus in claim 41 wherein the processor, in response to the stored instructions, updates, from a remote server, the probabilistic classifier and definitions of features associated with the first class.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Dumais, Susan T., Platt, John C., Horvitz, Eric, Heckerman, David E., Sahami, Mehran
Primary Examiner(s)
Luu, Le Hien
Assistant Examiner(s)
KANG, PAUL H

Application Number

US09/102,837
Time in Patent Office

903 Days
Field of Search

707/5, 707/6, 707/205, 709/246, 709/202, 709/206, 709/240, 709/205, 709/207, 395/200.01, 347/327
US Class Current

709/206
CPC Class Codes

G06F 16/353   into predefined classes

G06F 18/2411   based on the proximity to a...

G06Q 10/107   Computer-aided management o...

H04L 51/212   using filtering or selectiv...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

1256 Citations

65 Claims

Specification

Solutions

Use Cases

Quick Links

Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

1256 Citations

65 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links