Systems and Methods for Spam Detection Using Character Histograms
First Claim
1. A method comprising:
- employing a computer system to receive a target string forming a part of an electronic communication;
in response to receiving the target string, employing the computer system to determine a string eligibility criterion according to the target string;
employing the computer system to pre-filter a corpus of reference strings according to the string eligibility criterion, to produce a plurality of candidate strings;
in response to selecting the candidate strings, employing the computer system to perform a first comparison between a character histogram of the target string and a character histogram of a candidate string of the plurality of candidate strings, and a second comparison between a timestamp of the electronic communication and a timestamp of the candidate string; and
employing the computer system to determine whether the electronic communication is spam or non-spam according to a result of the first comparison and the second comparison.
2 Assignments
0 Petitions
Accused Products
Abstract
Described spam detection techniques including string identification, pre-filtering, and character histogram and timestamp comparison steps facilitate accurate, computationally-efficient detection of rapidly-changing spam arriving in short-lasting waves. In some embodiments, a computer system extracts a target character string from an electronic communication such as a blog comment, transmits it to an anti-spam server, and receives an indicator of whether the respective electronic communication is spam or non-spam from the anti-spam server. The anti-spam server determines whether the electronic communication is spam or non-spam according to certain features of the character histogram of the target string. Some embodiments also perform an unsupervised clustering of incoming target strings into clusters, wherein all members of a cluster have similar character histograms.
40 Citations
28 Claims
-
1. A method comprising:
-
employing a computer system to receive a target string forming a part of an electronic communication; in response to receiving the target string, employing the computer system to determine a string eligibility criterion according to the target string; employing the computer system to pre-filter a corpus of reference strings according to the string eligibility criterion, to produce a plurality of candidate strings; in response to selecting the candidate strings, employing the computer system to perform a first comparison between a character histogram of the target string and a character histogram of a candidate string of the plurality of candidate strings, and a second comparison between a timestamp of the electronic communication and a timestamp of the candidate string; and employing the computer system to determine whether the electronic communication is spam or non-spam according to a result of the first comparison and the second comparison. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer system comprising at least a processor programmed to:
-
receive a target string forming a part of an electronic communication; in response to receiving the target string, determine a string eligibility criterion according to the target string; pre-filter a corpus of reference strings according to the string eligibility criterion, to produce a plurality of candidate strings; in response to selecting the candidate strings, perform a first comparison between a character histogram of the target string and a character histogram of a candidate string of the plurality of candidate strings, and a second comparison between a timestamp of the electronic communication and a timestamp of the candidate string; and determine whether the electronic communication is spam or non-spam according to a result of the first comparison and the second comparison. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A method comprising:
-
employing a computer system to receive an electronic communication; in response to receiving the electronic communication, employing the computer system to extract a target string from the electronic communication; employing the computer system to transmit the target string to an anti-spam server; and in response to transmitting the target string, receiving a target label indicative of whether the electronic communication is spam or non-spam, wherein the target label is determined at the anti-spam server and wherein determining the target label comprises; employing the anti-spam server to determine an eligibility criterion according to the target string; employing the anti-spam server to pre-filter a corpus of reference strings according to the criterion condition, to produce a plurality of candidate strings; in response to selecting the candidate strings, employing the anti-spam server to perform a first comparison between a character histogram of the target string and a character histogram of a candidate string of the plurality of candidate strings, and a second comparison between a timestamp of the electronic communication and a timestamp of the candidate string; and employing the anti-spam server to determine the target label according to a result of the first comparison and the second comparison.
-
-
28. A method comprising:
-
employing a computer system to receive a target string forming a part of an electronic communication; in response to receiving the target string, employing the computer system to determine a string eligibility criterion according to the target string; employing the computer system to pre-filter a corpus of reference strings according to the string eligibility criterion, to produce a plurality of candidate strings; in response to selecting the candidate strings, employing the computer system to determine an inter-string distance separating the target string from a candidate string of the plurality of candidate strings, the inter-string distance determined according to a count of occurrences of a selected character within the target string and a count of occurrences of the selected character within the candidate string; and employing the computer system to determine whether the electronic communication is spam or non-spam according to the inter-string distance.
-
Specification