Systems and Methods for Providing a Spam Database and Identifying Spam Communications

US 20130018906A1
Filed: 07/11/2011
Published: 01/17/2013
Est. Priority Date: 07/11/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of identifying an incoming electronic communication as spam, the method comprising:

accessing the incoming electronic communication from a memory device;

creating a first set of tokens from the incoming electronic communication;

accessing a second set of tokens, wherein the second set of tokens corresponds to an electronic communication stored in a spam database;

determining a degree of similarity based on a count of unique tokens appearing in both the first set of tokens and the second set of tokens; and

identifying the incoming electronic communication as spam if the degree of similarity exceeds a predetermined threshold.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are provided for identifying unsolicited or unwanted electronic communications, such as spam. The disclosed embodiments also encompass systems and methods for selecting content items from a content item database. Consistent with certain embodiments, computer-implemented systems and methods may use a clustering based statistical content matching anti-spam algorithm to identify and filter spam. Such a anti-spam algorithm may be implemented to determine a degree of similarity between an incoming e-mail with a collection of one or more spam e-mails stored in a database. If the degree of similarity exceeds a predetermined threshold, the incoming e-mail may be classified as spam. Further, in accordance with other embodiments, systems and methods may be provided to determine a degree of similarity between a query or search string from a user and content items stored in a database. If the degree of similarity exceeds a predetermined threshold, the content item from the database may be identified as a content item that matches the query or search string provided by the user.

44 Citations

View as Search Results

23 Claims

1. A computer-implemented method of identifying an incoming electronic communication as spam, the method comprising:
- accessing the incoming electronic communication from a memory device;
  
  creating a first set of tokens from the incoming electronic communication;
  
  accessing a second set of tokens, wherein the second set of tokens corresponds to an electronic communication stored in a spam database;
  
  determining a degree of similarity based on a count of unique tokens appearing in both the first set of tokens and the second set of tokens; and
  
  identifying the incoming electronic communication as spam if the degree of similarity exceeds a predetermined threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer-implemented method of claim 1, wherein spam comprises undesired, unsolicited, or duplicative electronic communication sent indiscriminately to a plurality of users.
  - 3. The computer-implemented method of claim 1, wherein the incoming electronic communication comprises an e-mail, and wherein creating the first set of tokens comprises:
    - processing the e-mail by changing an upper-case letter into a lower-case letter and removing a space;
      
      creating the first set of tokens from the processed e-mail, each token having a predetermined length and overlapping a previous token by including one or more characters from the previous token;
      
      calculating a first total as a number of tokens in the first set of tokens; and
      
      storing the first set of tokens and the first total.
  - 4. The computer-implemented method of claim 3, wherein determining the degree of similarity comprises computing an easy signature by performing the steps of:
    - determining a number of common tokens based on a minimum of a first count of each unique token in the first set of tokens and a second count of the each unique token in the second set of tokens;
      
      calculating a second total as a number of tokens in the second set of tokens; and
      
      determining the easy signature as a ratio of the number of common tokens and a sum of the first total and the second total.
  - 5. The computer-implemented method of claim 3, wherein determining the degree of similarity comprises computing a randomized easy signature by performing the steps of:
    - selecting a set of most frequent tokens from the second set of tokens;
      
      computing a second total as a number of tokens in the selected set of most frequent tokens;
      
      randomly selecting a sub-set of tokens from the selected set of most frequent tokens;
      
      determining a number of common tokens based on a minimum of a first count of each unique token in the first set of tokens and a second count of the each unique token in the randomly selected sub-set of tokens; and
      
      determining the randomized easy signature as a ratio of the number of common tokens and a sum of the first total and the second total.
  - 6. The computer-implemented method of claim 5, wherein determining the degree of similarity comprises computing an average randomized easy signature by performing the steps of:
    - computing a plurality of randomized easy signatures; and
      
      averaging the plurality of randomized easy signatures.
  - 7. The computer-implemented method of claim 3, wherein the processing step is applied to the subject of the e-mail.
  - 8. The computer-implemented method of claim 3, wherein the processing step is applied to the body of the e-mail.
  - 9. The computer-implemented method of claim 3, wherein the predetermined length of each token is three.

10. A computer-implemented system of identifying an incoming e-mail as a spam e-mail, the system comprising:
- a spam database which stores a plurality of spam e-mails;
  
  a server which performs offline processing, the offline processing comprising;
  
  accessing a spam e-mail from the spam database;
  
  creating a first set of tokens from the spam e-mail;
  
  calculating a first total as a number of tokens in first set of tokens; and
  
  storing the first set of tokens and the first total; and
  
  a client which performs online processing, the online processing comprising;
  
  receiving the incoming e-mail;
  
  creating a second set of tokens from the incoming e-mail;
  
  calculating a second total as a number of tokens in the second set of tokens;
  
  accessing the first set of tokens and the first total corresponding to one of the plurality of spam e-mails in the spam database;
  
  determining a number of common tokens based on a minimum of a first count of each unique token in the first set of tokens and a second count of the each unique token in the second set of tokens;
  
  computing an easy signature as a ratio of the number of common tokens and the sum of the first total and the second total; and
  
  designating the incoming e-mail as spam when the easy signature exceeds a predetermined threshold.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The computer-implemented system of claim 10, wherein spam e-mail comprises undesired, unsolicited, or duplicative e-mail send indiscriminately to a plurality of users.
  - 12. The computer-implemented system of claim 10, wherein creating a first set of tokens comprises:
    - processing the spam e-mail by changing an upper-case letter into a lower-case letter and removing a space; and
      
      creating the first set of tokens from the processed spam e-mail, each token having a predetermined length and overlapping a previous token by including one or more characters from the previous token.
  - 13. The computer-implemented system of claim 10, wherein off-line processing further comprises:
    - computing a vector wherein an element of the vector represents a first count of a token selected from the first set of tokens;
      
      applying a K-means algorithm to identify a plurality of clusters of e-mails in the spam database based on the computed vector;
      
      identifying a representative spam e-mail for each cluster; and
      
      storing the first set of tokens and the first total corresponding to the identified representative spam e-mail.
  - 14. The computer-implemented system of claim 10, wherein off-line processing further comprises:
    - storing a second count for each spam e-mail stored in the spam database, wherein the second count represents a number of times the incoming e-mail was identified as spam based on the easy signature computed using the first set of tokens and the first total corresponding to the spam e-mail;
      
      computing an average count for a predetermined time period; and
      
      removing the spam e-mail from the spam database when the average count is less than a predetermined threshold.

15. A computer program product comprising executable instructions tangibly embodied in a non-transitory computer-readable medium which, when executed by a processor, perform a method of identifying an electronic communication as spam, the method comprising:
- accessing the electronic communication from a memory device;
  
  creating a first set of tokens from the electronic communication;
  
  accessing a second set of tokens, corresponding to spam communication stored in a spam database;
  
  determining a degree of similarity based on a count of unique tokens appearing in both the first set of tokens and the second set of tokens; and
  
  identifying the electronic communication as spam if the degree of similarity exceeds a predetermined threshold.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
- - 16. The computer program product of claim 15, wherein spam comprises undesired, unsolicited, or duplicative electronic communication send indiscriminately to a plurality of users.
  - 17. The computer program product of claim 15, wherein creating the first set of tokens comprises:
    - processing the electronic communication by changing an upper-case letter into a lower-case letter and removing a space;
      
      creating the first set of tokens from the processed electronic communication, each token having a predetermined length and overlapping a previous token by including one or more characters from the previous token;
      
      calculating a first total as a number of tokens in the first set of tokens;
      
      storing the first set of tokens and the first total.
  - 18. The computer program product of claim 15, wherein determining the degree of similarity comprises computing an easy signature by performing the steps of:
    - accessing a second total from the memory device, wherein the second total represents a number of tokens in the second set of tokens;
      
      determining a number of common tokens based on a minimum of a first count of each unique token in the first set of tokens and a second count of the each unique token in the second set of tokens; and
      
      determining the easy signature as a ratio of the number of common tokens and a sum of the first total and the second total.
  - 19. The computer program product of claim 15, wherein determining the degree of similarity comprises computing a randomized easy signature by performing the steps of:
    - selecting a set of most frequent tokens from the second set of tokens;
      
      computing a second total as a number of tokens in the selected set of most frequent tokens;
      
      randomly selecting a sub-set of tokens from the selected set of most frequent tokens;
      
      determining a number of common tokens based on a minimum of a first count of each unique token in the first set of tokens and a second count of the each unique token in the randomly selected sub-set of tokens; and
      
      determining the randomized easy signature as a ratio of the number of common tokens and a sum of the first total and the second total.
  - 20. The computer program product of claim 19, wherein determining the degree of similarity comprises computing an average randomized easy signature by performing the steps of:
    - computing a plurality of randomized easy signatures; and
      
      averaging the plurality of randomized easy signatures.
  - 21. The computer program product of claim 17, wherein the processing step is applied to the subject of the electronic communication.
  - 22. The computer program product of claim 17, wherein the processing step is applied to the body of the electronic communication.
  - 23. The computer program product of claim 17, wherein the electronic communication comprises at least one of an e-mail, an instant message, a chat message, a text messages, a SMS message, and a paging communication.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verizon Media, Inc. (Verizon Communications Inc.), Yahoo Assets LLC
Original Assignee
AOL Inc. (Apollo Global Management, Inc.)
Inventors
Chandrasekharappa, Santhosh Baramasagara, Ekambaram, Sivakumar, Sargent, James, Moortgat, Jean-Jacques, NIGAM, Rakesh, Selvaraj, Senthil Kumar Sellaiya

Granted Patent

US 9,407,463 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/758
CPC Class Codes

G06F 16/90344   by using string matching te...

H04L 51/212   using filtering or selectiv...

H04L 51/216   Handling conversation histo...

Systems and Methods for Providing a Spam Database and Identifying Spam Communications

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

44 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and Methods for Providing a Spam Database and Identifying Spam Communications

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links