Spam email detection based on n-grams with feature selection
First Claim
1. A computer implemented method for identifying spam email messages, the method comprising the steps of:
- tokenizing an email message into a collection of overlapping n-grams;
comparing the collection of n-grams to n-grams of known artifacts found in email messages due to how the email messages were produced and transmitted, wherein the known artifacts comprise machine-generated text artifacts included in the email messages by email service providers;
removing n-grams that match an n-gram of a known artifact from the collection;
comparing the remaining n-grams in the collection to n-grams of known spam email messages; and
determining whether the email message comprises spam based on results of the second comparing step.
2 Assignments
0 Petitions
Accused Products
Abstract
A similarity measurement manager uses n-gram analysis to identify spam email messages. The similarity measurement manager tokenizing an email message into a plurality of overlapping n-grams, wherein n is large enough to identify uniqueness of artifacts. The similarity measurement manager employs feature selection by comparing the created n-grams to n-grams of known artifacts which were created according to the same methodology. Created n-grams that match an n-gram of a known artifact are ignored. The similarity measurement manager compares the remaining created n-grams to pluralities of n-grams of known spam email messages, the n-grams of the known spam email messages being themselves created by executing the same steps. The similarity measurement manager determines whether the email message comprises spam based on whether or not the n-gram comparison indicates that it is substantially similar to a known spam email message.
83 Citations
19 Claims
-
1. A computer implemented method for identifying spam email messages, the method comprising the steps of:
-
tokenizing an email message into a collection of overlapping n-grams; comparing the collection of n-grams to n-grams of known artifacts found in email messages due to how the email messages were produced and transmitted, wherein the known artifacts comprise machine-generated text artifacts included in the email messages by email service providers; removing n-grams that match an n-gram of a known artifact from the collection; comparing the remaining n-grams in the collection to n-grams of known spam email messages; and determining whether the email message comprises spam based on results of the second comparing step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A non-transitory computer readable storage medium containing an executable computer program product for identifying spam email messages, the computer program product comprising:
-
program code, when executed by a computer processor, causing the computer processor to tokenize an email message into a collection of overlapping n-grams; program code, when executed by a computer processor, causing the computer processor to compare the collection of n-grams to n-grams of known artifacts found in email messages due to how the email messages were produced and transmitted, wherein the known artifacts comprise machine-generated text artifacts included in the email messages by email service providers; program code, when executed by a computer processor, causing the computer processor to remove n-grams that match an n-gram of a known artifact from the collection; program code, when executed by a computer processor, causing the computer processor to compare the remaining n-grams in the collection to n-grams of known spam email messages; and program code, when executed by a computer processor, causing the computer processor to determine whether the email message comprises spam based on results of the second comparing step. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computer system for identifying spam email messages, the computer system comprising:
-
a computer processor for executing computer program instructions; and a non-transitory computer-readable storage medium having executable computer program instructions tangibly embodied thereon, the executable computer program instructions comprising program instructions, when executed by the computer processor, causing the computer processor to instructions; tokenize an email message into a collection of overlapping n-grams; compare the collection of n-grams to n-grams of known artifacts found in email messages due to how the email messages were produced and transmitted, wherein the known artifacts comprise machine-generated text artifacts included in the email messages by email service providers; remove n-grams that match an n-gram of a known artifact from the collection; compare the remaining n-grams in the collection to n-grams of known spam email messages; and determine whether the email message comprises spam based on results of the second comparing step. - View Dependent Claims (17, 18, 19)
-
Specification