Linguistic nonsense detection for undesirable message classification
First Claim
1. A computer implemented method for identifying undesirable electronic messages by a computer, the method comprising the steps of:
- identifying, by the computer, incoming electronic messages;
normalizing, by the computer, identified electronic messages according to a plurality of rules for distinguishing non-legitimate words from legitimate words, said normalizing further comprising identifying non-legitimate words obfuscating electronic messages according to the plurality of rules, and deleting the identified non-legitimate words from the electronic messages;
wherein said plurality of rules for distinguishing non-legitimate words from legitimate words comprises at least three rules from a group of rules consisting of;
a rule specifying a maximum number of consecutive vowels in a legitimate word;
a rule specifying a maximum number of consecutive consonants in a legitimate word;
a rule specifying a maximum number of consecutive uses of any single character in a legitimate word;
a rule specifying a maximum number of transitions between upper case letters and lower case letters in a legitimate word;
a rule specifying a maximum length of a legitimate word containing numbers without punctuation;
a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters and numbers;
a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters, numbers and punctuation;
a rule specifying a minimum number of vowels in a legitimate word;
a rule specifying a minimum number of consonants in a legitimate word;
a rule specifying a minimum ratio of vowels to consonants in a legitimate word; and
a rule specifying a maximum ratio of vowels to consonants in a legitimate word; and
analyzing, by the computer, normalized electronic message to identify undesirable electronic messages.
5 Assignments
0 Petitions
Accused Products
Abstract
Nonsense words are removed from incoming emails and visually similar (look-alike) characters are replaced with the actual, corresponding characters, so that the emails can be more accurately analyzed to see if they are spam. More specifically, an incoming email stream is filtered, and the emails are normalized to enable more accurate spam detection. In some embodiments, the normalization comprises the removal of nonsense words and/or the replacement of look-alike characters according to a set of rules. In other embodiments, more and/or different normalization techniques are utilized. In some embodiments, the language in which an email is written is identified in order to aid in the normalization. Once incoming emails are normalized, they are then analyzed to detect spam or other forms of undesirable email, such as phishing emails.
56 Citations
18 Claims
-
1. A computer implemented method for identifying undesirable electronic messages by a computer, the method comprising the steps of:
-
identifying, by the computer, incoming electronic messages; normalizing, by the computer, identified electronic messages according to a plurality of rules for distinguishing non-legitimate words from legitimate words, said normalizing further comprising identifying non-legitimate words obfuscating electronic messages according to the plurality of rules, and deleting the identified non-legitimate words from the electronic messages; wherein said plurality of rules for distinguishing non-legitimate words from legitimate words comprises at least three rules from a group of rules consisting of; a rule specifying a maximum number of consecutive vowels in a legitimate word;
a rule specifying a maximum number of consecutive consonants in a legitimate word;
a rule specifying a maximum number of consecutive uses of any single character in a legitimate word;
a rule specifying a maximum number of transitions between upper case letters and lower case letters in a legitimate word;
a rule specifying a maximum length of a legitimate word containing numbers without punctuation;
a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters and numbers;
a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters, numbers and punctuation;
a rule specifying a minimum number of vowels in a legitimate word;
a rule specifying a minimum number of consonants in a legitimate word;
a rule specifying a minimum ratio of vowels to consonants in a legitimate word; and
a rule specifying a maximum ratio of vowels to consonants in a legitimate word; andanalyzing, by the computer, normalized electronic message to identify undesirable electronic messages. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. At least one non-transitory medium storing a computer program product for identifying undesirable electronic messages, the computer program product comprising:
-
program code for identifying incoming electronic messages; program code for normalizing identified electronic messages according to a plurality rules for distinguishing non-legitimate words from legitimate words, said normalizing further comprising identifying non-legitimate words obfuscating electronic messages according to the plurality of rules and deleting the identified non-legitimate words from the electronic messages; wherein said plurality of rules for distinguishing non-legitimate words from legitimate words comprises at least three rules from a group of rules consisting of; a rule specifying a maximum number of consecutive vowels in a legitimate word; a rule specifying a maximum number of consecutive consonants in a legitimate word; a rule specifying a maximum number of consecutive uses of any single character in a legitimate word; a rule specifying a maximum number of transitions between upper case letters and lower case letters in a legitimate word; a rule specifying a maximum length of a legitimate word containing numbers without punctuation;
a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters and numbers;a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters, numbers and punctuation; a rule specifying a minimum number of vowels in a legitimate word; a rule specifying a minimum number of consonants in a legitimate word; a rule specifying a minimum ratio of vowels to consonants in a legitimate word; and a rule specifying a maximum ratio of vowels to consonants in a legitimate word; and program code for analyzing normalized electronic message to identify undesirable electronic messages. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer system for identifying undesirable electronic messages, the computer system comprising:
-
means for identifying incoming electronic messages; means for normalizing identified electronic messages according to a plurality of rules for distinguishing non-legitimate words from legitimate words, said normalizing further comprising identifying non-legitimate words obfuscating electronic messages according to the plurality of rules, and deleting the identified non-legitimate words from the electronic messages; wherein said plurality of rules for distinguishing non-legitimate words from legitimate words comprises at least three rules from a group of rules consisting of; a rule specifying a maximum number of consecutive vowels in a legitimate word;
a rule specifying a maximum number of consecutive consonants in a legitimate word;
a rule specifying a maximum number of consecutive uses of any single character in a legitimate word;
a rule specifying a maximum number of transitions between upper case letters and lower case letters in a legitimate word;
a rule specifying a maximum length of a legitimate word containing numbers without punctuation;
a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters and numbers;
a rule specifying a maximum length of a legitimate word containing upper case letters, lower case letters, numbers and punctuation;
a rule specifying a minimum number of vowels in a legitimate word;
a rule specifying a minimum number of consonants in a legitimate word;
a rule specifying a minimum ratio of vowels to consonants in a legitimate word; and
a rule specifying a maximum ratio of vowels to consonants in a legitimate word; andmeans for analyzing normalized electronic message to identify undesirable electronic messages. - View Dependent Claims (16, 17, 18)
-
Specification