Email analysis using fuzzy matching of text
First Claim
1. A method for analyzing character codes in text of a message to determine a probability that the message is spam, the method comprising:
- (a) receiving a message that includes character codes in text of the message;
(b) identifying character codes of the text that are likely being used to obfuscate a word or phrase of the text;
(c) deobfuscating each word or phrase of the text that is identified at step (b) as likely being obfuscated, to produce deobfuscated text; and
(d) determining an extent that the character codes of the text are identified as likely being used to obfuscate a word or phrase of the text by determining one or more of the followinga quantity of words or phrases identified at step (b) as likely being obfuscated;
which particular words or phrases are identified at step (b) as likely being obfuscated, and(e) analyzing the deobfuscated text by comparing the deobfuscated text to text of one or more other messages known to be spam;
(f) determining a probability that the message is spam based on both(i) results of the analyzing the deobfuscated text, by comparing the deobfuscated text to text of one or more other messages known to be spam, performed at step (e), and(ii) results of the determining the extent that the character codes of the text are likely being used to obfuscate a word or phrase of the text, performed at step (d);
wherein at least steps (b), (c), (d), (e) and (f) are performed by one or more processors.
1 Assignment
0 Petitions
Accused Products
Abstract
Translation of text or messages provides a message that is more reliably or efficiently analyzed for purposes as, for example, to detect spam in email messages. One translation process takes into account statistics of erroneous and intentional misspellings. Another process identifies and removes characters or character codes that do not generate visible symbols in a message displayed to a user. Another process detects symbols such as periods, commas, dashes, etc., interspersed in text such that the symbols do not unduly interfere with, or prevent, a user from perceiving a spam message. Another process can detect use of foreign language symbols and terms. Still other processes and techniques are presented to counter obfuscating spammer tactics and to provide for efficient and accurate analysis of message content. Groups of similar content items (e.g., words, phrases, images, ASCII text, etc.) are correlated and analysis can proceed after substitution of items in the group with other items in the group so that a more accurate detection of “sameness” of content can be achieved. Dictionaries are used for spam or ham words or phrases. Other features are described.
191 Citations
20 Claims
-
1. A method for analyzing character codes in text of a message to determine a probability that the message is spam, the method comprising:
-
(a) receiving a message that includes character codes in text of the message; (b) identifying character codes of the text that are likely being used to obfuscate a word or phrase of the text; (c) deobfuscating each word or phrase of the text that is identified at step (b) as likely being obfuscated, to produce deobfuscated text; and (d) determining an extent that the character codes of the text are identified as likely being used to obfuscate a word or phrase of the text by determining one or more of the following a quantity of words or phrases identified at step (b) as likely being obfuscated; which particular words or phrases are identified at step (b) as likely being obfuscated, and (e) analyzing the deobfuscated text by comparing the deobfuscated text to text of one or more other messages known to be spam; (f) determining a probability that the message is spam based on both (i) results of the analyzing the deobfuscated text, by comparing the deobfuscated text to text of one or more other messages known to be spam, performed at step (e), and (ii) results of the determining the extent that the character codes of the text are likely being used to obfuscate a word or phrase of the text, performed at step (d); wherein at least steps (b), (c), (d), (e) and (f) are performed by one or more processors. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A machine-readable storage medium including instructions executable by one or more processors to analyze character codes in text of a message to determine a probability that the message is spam, machine-readable storage medium comprising:
-
instructions to receive a message that includes character codes in text of the message; instructions to identify character codes of the text that are likely being used to obfuscate a word or phrase of the text; instructions to deobfuscate each word or phrase of the text that is identified as likely being obfuscated, to produce deobfuscated text; instructions to determine an extent that the character codes of the text are identified as likely being used to obfuscate a word or phrase of the text by determining one or more of the following a quantity of words or phrases identified as likely being obfuscated, and which particular words or phrases are identified as likely being obfuscated; instructions to analyze the deobfuscated text by comparing the deobfuscated text to text of one or more other messages known to be spam; and instructions to determine a probability that the message is spam based on both (i) results of analysis of the deobfuscated text by comparing the deobfuscated text to text of one or more other messages known to be spam, and (i) results of determining the extent that the character codes of the text are likely being used to obfuscate a word or phrase of the text. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A system to analyze character codes in text of a message to determine a probability that the message is spam, comprising:
-
at least one processor; machine-readable storage medium including instructions that are executable by the at least one processor, the instructions including instructions to receive a message that includes character codes in text of the message; instructions to identify character codes of the text that are likely being used to obfuscate a word or phrase of the text; instructions to deobfuscate each word or phrase of the text that is identified as likely being obfuscated, to produce deobfuscated text; instructions to determine an extent that the character codes of the text are identified as likely being used to obfuscate a word or phrase of the text by determining one or more of the following a quantity of words or phrases identified as likely being obfuscated, and which particular words or phrases are identified as likely being obfuscated; instructions to analyze the deobfuscated text by comparing the deobfuscated text to text of one or more other messages known to be spam; and instructions to determine a probability that the message is spam based on both (i) results of analysis of the deobfuscated text, by comparing the deobfuscated text to text of one or more other messages known to be spam, and (ii) results of determining the extent that the character codes of the text are likely being used to obfuscate a word or phrase of the text.
-
Specification