Trees of classifiers for detecting email spam
First Claim
1. A system that facilitates classification of electronic mail, comprising:
- a memory having stored therein computer executable components; and
a processor that executes the computer executable components that comprise;
a feature detection component configured to detect a feature of an obtained email, wherein the feature detected includes length of the obtained email;
an email classifier component configured to preprocess the obtained email by at least one of removing commented text that is invisible when reading the email or removing characters that are invisible when reading the email; and
a first decision tree including a classification component configured to classify the obtained email, the first decision tree including the classification component comprising;
a plurality of distinct email classifiers; and
a second decision tree including a feature classifier configured to select a distinct email classifier of the plurality of distinct email classifiers based at least on the feature detected in the obtained email and a subset of features from the feature classifier included in the second decision tree such that the distinct email classifier selected from the plurality of distinct email classifiers comprises an optimum email classifier for classifying the obtained email in comparison with other distinct email classifiers of the plurality of distinct email classifiers;
internal nodes of the first decision tree including the classification component correspond to the feature classifier; and
leaf nodes of the first decision tree including the classification component represent probability values of the feature associated with the plurality of distinct email classifiers compared with one or more spam confidence thresholds.
2 Assignments
0 Petitions
Accused Products
Abstract
Decision trees populated with classifier models are leveraged to provide enhanced spam detection utilizing separate email classifiers for each feature of an email. This provides a higher probability of spam detection through tailoring of each classifier model to facilitate in more accurately determining spam on a feature-by-feature basis. Classifiers can be constructed based on linear models such as, for example, logistic-regression models and/or support vector machines (SVM) and the like. The classifiers can also be constructed based on decision trees. “Compound features” based on internal and/or external nodes of a decision tree can be utilized to provide linear classifier models as well. Smoothing of the spam detection results can be achieved by utilizing classifier models from other nodes within the decision tree if training data is sparse. This forms a base model for branches of a decision tree that may not have received substantial training data.
-
Citations
20 Claims
-
1. A system that facilitates classification of electronic mail, comprising:
-
a memory having stored therein computer executable components; and a processor that executes the computer executable components that comprise; a feature detection component configured to detect a feature of an obtained email, wherein the feature detected includes length of the obtained email; an email classifier component configured to preprocess the obtained email by at least one of removing commented text that is invisible when reading the email or removing characters that are invisible when reading the email; and a first decision tree including a classification component configured to classify the obtained email, the first decision tree including the classification component comprising; a plurality of distinct email classifiers; and a second decision tree including a feature classifier configured to select a distinct email classifier of the plurality of distinct email classifiers based at least on the feature detected in the obtained email and a subset of features from the feature classifier included in the second decision tree such that the distinct email classifier selected from the plurality of distinct email classifiers comprises an optimum email classifier for classifying the obtained email in comparison with other distinct email classifiers of the plurality of distinct email classifiers; internal nodes of the first decision tree including the classification component correspond to the feature classifier; and leaf nodes of the first decision tree including the classification component represent probability values of the feature associated with the plurality of distinct email classifiers compared with one or more spam confidence thresholds. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method for facilitating classification of electronic mail, wherein the method is performed by a processor that executes acts comprising:
-
obtaining an email from a source as an obtained email; preprocessing the obtained email by at least one of removing commented text that is invisible when reading the email or removing characters that are invisible when reading the email; classifying the obtained email, via a first decision tree including a classification component comprising a decision tree of at least one feature classifier; and selecting, using a second decision tree of classifier models including a feature classifier, a distinct email classifier of a plurality of distinct email classifiers for detecting whether the obtained email is spam, the selecting being based at least on detected features of the obtained email and a subset of the features from the feature classifier included in the second decision tree, the distinct email classifier of the plurality of distinct email classifiers being tailored as an optimum email classifier for the obtained email in comparison with other distinct email classifiers of the plurality of distinct email classifiers, and the features of the obtained email comprising at least length of the obtained email. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. One or more computer readable media, wherein the computer readable media is not a signal and has physical structure the computer readable media having stored therein computer executable components, the computer executable components upon execution by a processor configuring a computer to perform operations comprising:
-
detecting features of an obtained email, wherein at least one feature of the features of the obtained email detected includes content encoding; classifying the obtained email, via a first decision tree including a classification component comprising a decision tree of at least one feature classifier; selecting a distinct email classifier of a plurality of distinct email classifiers via a second decision tree including a feature classifier, the selecting being based at least on the feature detected in the obtained email and a subset of features from the feature classifier included in the second decision tree; and classifying the obtained email based at least on the features of the obtained email and the distinct email classifier of the plurality of distinct email classifiers selected via the second decision tree, wherein; root and interior nodes of the second decision tree comprise tests on the features of the at least one obtained email; leaf nodes of the second decision tree are associated with the plurality of distinct email classifiers; and the distinct email classifier of the plurality of distinct email classifiers selected via the second decision tree is tailored as an optimum email classifier for the obtained email in comparison with other distinct email classifiers of the plurality of distinct email classifiers. - View Dependent Claims (20)
-
Specification