Systems and methods for providing a spam database and identifying spam communications

US 9,407,463 B2
Filed: 07/11/2011
Issued: 08/02/2016
Est. Priority Date: 07/11/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented system of identifying an incoming e-mail as a spam e-mail, the system comprising:

at least one processor;

a spam database which stores a plurality of known spam e-mails;

a hardware server which performs offline processing, the offline processing comprising;

accessing a known spam e-mail from the spam database;

creating a first set of tokens from the known spam e-mail;

calculating a first total as a number of tokens in first set of tokens;

storing the first set of tokens and the first total;

storing a third count for each known spam e-mail stored in the spam database, wherein the third count represents a number of times the incoming e-mail was identified as spam based on an easy signature computed using the first set of tokens and the first total corresponding to the known spam e-mail;

computing an average count between the first count and a third count based on a minimum of the first count and the third count, the third count being a count of the unique token in the third set of tokens for a predetermined time period; and

removing the known spam e-mail from the spam database when the average count is less than a predetermined threshold;

a client which performs online processing, the online processing comprising;

receiving the incoming e-mail;

creating a second set of tokens from the incoming e-mail;

calculating a second total as a number of tokens in the second set of tokens;

accessing the first set of tokens and the first total corresponding to one of the plurality of known spam e-mails in the spam database;

determining a number of common tokens based on a minimum of a first count and a second count, the first count being a count of each unique token in the first set of tokens and the second count being a count of the each unique token in the second set of tokens;

computing an easy signature as a ratio of the number of common tokens and the sum of the first total and the second total; and

designating the incoming e-mail as spam when the easy signature exceeds a predetermined threshold; and

wherein when the easy signature does not exceed the predetermined thresholdthe server determines whether there are additional known spam e-mails in the spam database; and

the client designates the incoming e-mail as not spam when there are no additional known spam e-mails in the spam database.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are provided for identifying unsolicited or unwanted electronic communications, such as spam. The disclosed embodiments also encompass systems and methods for selecting content items from a content item database. Consistent with certain embodiments, computer-implemented systems and methods may use a clustering based statistical content matching anti-spam algorithm to identify and filter spam. Such a anti-spam algorithm may be implemented to determine a degree of similarity between an incoming e-mail with a collection of one or more spam e-mails stored in a database. If the degree of similarity exceeds a predetermined threshold, the incoming e-mail may be classified as spam. Further, in accordance with other embodiments, systems and methods may be provided to determine a degree of similarity between a query or search string from a user and content items stored in a database. If the degree of similarity exceeds a predetermined threshold, the content item from the database may be identified as a content item that matches the query or search string provided by the user.

54 Citations

View as Search Results

7 Claims

1. A computer-implemented system of identifying an incoming e-mail as a spam e-mail, the system comprising:
- at least one processor;
  
  a spam database which stores a plurality of known spam e-mails;
  
  a hardware server which performs offline processing, the offline processing comprising;
  
  accessing a known spam e-mail from the spam database;
  
  creating a first set of tokens from the known spam e-mail;
  
  calculating a first total as a number of tokens in first set of tokens;
  
  storing the first set of tokens and the first total;
  
  storing a third count for each known spam e-mail stored in the spam database, wherein the third count represents a number of times the incoming e-mail was identified as spam based on an easy signature computed using the first set of tokens and the first total corresponding to the known spam e-mail;
  
  computing an average count between the first count and a third count based on a minimum of the first count and the third count, the third count being a count of the unique token in the third set of tokens for a predetermined time period; and
  
  removing the known spam e-mail from the spam database when the average count is less than a predetermined threshold;
  
  a client which performs online processing, the online processing comprising;
  
  receiving the incoming e-mail;
  
  creating a second set of tokens from the incoming e-mail;
  
  calculating a second total as a number of tokens in the second set of tokens;
  
  accessing the first set of tokens and the first total corresponding to one of the plurality of known spam e-mails in the spam database;
  
  determining a number of common tokens based on a minimum of a first count and a second count, the first count being a count of each unique token in the first set of tokens and the second count being a count of the each unique token in the second set of tokens;
  
  computing an easy signature as a ratio of the number of common tokens and the sum of the first total and the second total; and
  
  designating the incoming e-mail as spam when the easy signature exceeds a predetermined threshold; and
  
  wherein when the easy signature does not exceed the predetermined thresholdthe server determines whether there are additional known spam e-mails in the spam database; and
  
  the client designates the incoming e-mail as not spam when there are no additional known spam e-mails in the spam database.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented system of claim 1, wherein known spam e-mail comprises undesired, unsolicited, or duplicative e-mail send indiscriminately to a plurality of users.
  - 3. The computer-implemented system of claim 1, wherein creating a first set of tokens comprises:
    - processing the known spam e-mail by changing an upper-case letter into a lower-case letter and removing a space; and
      
      creating the first set of tokens from the processed spam e-mail, each token having a predetermined length and overlapping a previous token by including one or more characters from the previous token.
  - 4. The computer-implemented system of claim 3, wherein the processing is applied to the subject of the e-mail.
  - 5. The computer-implemented system of claim 3, wherein the processing is applied to the body of the e-mail.
  - 6. The computer-implemented system of claim 3, wherein the predetermined length of each token is three.
  - 7. The computer-implemented system of claim 1, wherein off-line processing further comprises:
    - computing a vector wherein an element of the vector represents a first count of a token selected from the first set of tokens;
      
      applying a K-means algorithm to identify a plurality of clusters of e-mails in the spam database based on the computed vector;
      
      identifying a representative spam e-mail for each cluster; and
      
      storing the first set of tokens and the first total corresponding to the identified representative spam e-mail.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verizon Media, Inc. (Verizon Communications Inc.), Yahoo Assets LLC
Original Assignee
AOL Inc. (Apollo Global Management, Inc.)
Inventors
Nigam, Rakesh, Selvaraj, Senthil Kumar Sellaiya, Chandrasekharappa, Santhosh Baramasagara, Ekambaram, Sivakumar, Sargent, James, Moortgat, Jean-Jacques
Primary Examiner(s)
Le, Miranda

Application Number

US13/179,863
Publication Number

US 20130018906A1
Time in Patent Office

1,849 Days
Field of Search

707/758, 707/999.009, 707/709, 707/737, 707/754, 707/554, 707/755
US Class Current

1/1
CPC Class Codes

G06F 16/90344   by using string matching te...

H04L 51/212   using filtering or selectiv...

H04L 51/216   Handling conversation histo...

Systems and methods for providing a spam database and identifying spam communications

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

54 Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for providing a spam database and identifying spam communications

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

54 Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links