Phonetic filtering of undesired email messages
First Claim
Patent Images
1. A method comprising:
- training an email system for determining spam, where training includes at least the following;
tokenizing at least a portion of a first email message to create a first token;
creating a second token comprising a phonetic equivalent of the first token, wherein creating the phonetic equivalent of the first token comprises;
identifying a string of characters within the first token, the string of characters including a non-alphabetic character; and
removing the non-alphabetic character from the string of characters;
determining, from each token created, a spam probability for the first email message;
determining whether each token created is present in a database of tokens;
in response to a determination that a token created is not present in the database of tokens, assigning a probability value for the token created as spam and adding the token created and the probability value to the database of tokens; and
in response to a determination that the token created is present in the database of tokens, updating an assigned probability value for the token present to reflect contribution of the token created; and
filtering a second email message according to the training.
0 Assignments
0 Petitions
Accused Products
Abstract
Several embodiments, among others, provided in the present disclosure teach a filtering of email messages for spam based on phonetic equivalents of words found in the email message. In some embodiments, an email message having a word is received, and a phonetic equivalent of the word is generated. Thereafter, the phonetic equivalent of the word is tokenized to generate a token representative of the phonetic equivalent. The generated token is then used to determine a spam probability.
79 Citations
17 Claims
-
1. A method comprising:
-
training an email system for determining spam, where training includes at least the following; tokenizing at least a portion of a first email message to create a first token; creating a second token comprising a phonetic equivalent of the first token, wherein creating the phonetic equivalent of the first token comprises; identifying a string of characters within the first token, the string of characters including a non-alphabetic character; and removing the non-alphabetic character from the string of characters; determining, from each token created, a spam probability for the first email message; determining whether each token created is present in a database of tokens; in response to a determination that a token created is not present in the database of tokens, assigning a probability value for the token created as spam and adding the token created and the probability value to the database of tokens; and in response to a determination that the token created is present in the database of tokens, updating an assigned probability value for the token present to reflect contribution of the token created; and filtering a second email message according to the training. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
a memory that stores; first tokenize logic configured to tokenize at least a portion of a received email message to create a first token; second tokenize logic configured to tokenize a phonetic equivalent of the first token, the second tokenize logic comprising; string-identification logic configured to identify a string of characters in the first token, the string of characters including a non-alphabetic character; and character-removal logic configured to remove the non-alphabetic character from the string of characters; third tokenize logic configured to tokenize an attachment of the received email message; spam-determination logic configured to determine a spam probability value from the generated tokens; and sorting logic configured to sort generated tokens in accordance with the corresponding determined spam probability value. - View Dependent Claims (9, 10, 11, 12)
-
13. A computer-readable medium that includes a program that, when executed by a computer, causes the computer to perform at least the following:
-
tokenize at least a portion of a received email message to create a first token; generate a phonetic equivalent of the first token, wherein generating the phonetic equivalent of the first token comprises; identifying a string of characters within the first token, the string of characters including a non-alphabetic character; and removing the non-alphabetic character from the string of characters; tokenize the phonetic equivalent of the word to create a second token; determine a spam probability from each token created; and sort each token created in accordance with the corresponding determined spam probability value. - View Dependent Claims (14, 15, 16, 17)
-
Specification