Systems and methods for providing a spam database and identifying spam communications
First Claim
1. A computer-implemented system of identifying an incoming e-mail as a spam e-mail, the system comprising:
- at least one processor;
a spam database which stores a plurality of known spam e-mails;
a hardware server which performs offline processing, the offline processing comprising;
accessing a known spam e-mail from the spam database;
creating a first set of tokens from the known spam e-mail;
calculating a first total as a number of tokens in first set of tokens;
storing the first set of tokens and the first total;
storing a third count for each known spam e-mail stored in the spam database, wherein the third count represents a number of times the incoming e-mail was identified as spam based on an easy signature computed using the first set of tokens and the first total corresponding to the known spam e-mail;
computing an average count between the first count and a third count based on a minimum of the first count and the third count, the third count being a count of the unique token in the third set of tokens for a predetermined time period; and
removing the known spam e-mail from the spam database when the average count is less than a predetermined threshold;
a client which performs online processing, the online processing comprising;
receiving the incoming e-mail;
creating a second set of tokens from the incoming e-mail;
calculating a second total as a number of tokens in the second set of tokens;
accessing the first set of tokens and the first total corresponding to one of the plurality of known spam e-mails in the spam database;
determining a number of common tokens based on a minimum of a first count and a second count, the first count being a count of each unique token in the first set of tokens and the second count being a count of the each unique token in the second set of tokens;
computing an easy signature as a ratio of the number of common tokens and the sum of the first total and the second total; and
designating the incoming e-mail as spam when the easy signature exceeds a predetermined threshold; and
wherein when the easy signature does not exceed the predetermined thresholdthe server determines whether there are additional known spam e-mails in the spam database; and
the client designates the incoming e-mail as not spam when there are no additional known spam e-mails in the spam database.
8 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods are provided for identifying unsolicited or unwanted electronic communications, such as spam. The disclosed embodiments also encompass systems and methods for selecting content items from a content item database. Consistent with certain embodiments, computer-implemented systems and methods may use a clustering based statistical content matching anti-spam algorithm to identify and filter spam. Such a anti-spam algorithm may be implemented to determine a degree of similarity between an incoming e-mail with a collection of one or more spam e-mails stored in a database. If the degree of similarity exceeds a predetermined threshold, the incoming e-mail may be classified as spam. Further, in accordance with other embodiments, systems and methods may be provided to determine a degree of similarity between a query or search string from a user and content items stored in a database. If the degree of similarity exceeds a predetermined threshold, the content item from the database may be identified as a content item that matches the query or search string provided by the user.
54 Citations
7 Claims
-
1. A computer-implemented system of identifying an incoming e-mail as a spam e-mail, the system comprising:
-
at least one processor; a spam database which stores a plurality of known spam e-mails; a hardware server which performs offline processing, the offline processing comprising; accessing a known spam e-mail from the spam database; creating a first set of tokens from the known spam e-mail; calculating a first total as a number of tokens in first set of tokens; storing the first set of tokens and the first total; storing a third count for each known spam e-mail stored in the spam database, wherein the third count represents a number of times the incoming e-mail was identified as spam based on an easy signature computed using the first set of tokens and the first total corresponding to the known spam e-mail; computing an average count between the first count and a third count based on a minimum of the first count and the third count, the third count being a count of the unique token in the third set of tokens for a predetermined time period; and removing the known spam e-mail from the spam database when the average count is less than a predetermined threshold; a client which performs online processing, the online processing comprising; receiving the incoming e-mail; creating a second set of tokens from the incoming e-mail; calculating a second total as a number of tokens in the second set of tokens; accessing the first set of tokens and the first total corresponding to one of the plurality of known spam e-mails in the spam database; determining a number of common tokens based on a minimum of a first count and a second count, the first count being a count of each unique token in the first set of tokens and the second count being a count of the each unique token in the second set of tokens; computing an easy signature as a ratio of the number of common tokens and the sum of the first total and the second total; and designating the incoming e-mail as spam when the easy signature exceeds a predetermined threshold; and wherein when the easy signature does not exceed the predetermined threshold the server determines whether there are additional known spam e-mails in the spam database; and the client designates the incoming e-mail as not spam when there are no additional known spam e-mails in the spam database. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
Specification