Method and system for classifying electronic text messages and spam messages

US 7,836,061 B1
Filed: 12/29/2007
Issued: 11/16/2010
Est. Priority Date: 12/29/2007
Status: Active Grant

First Claim

Patent Images

1. A method of identifying electronic text messages as spam, the method comprising:

(a) creating a hierarchic list of spam message categories and sub-categories, wherein the hierarchic list defines properties of key terms within the spam message categories and sub-categories;

(b) composing a database of the key terms and a database of sample messages in a human language for each of the spam message categories and message templates for sub-categories, wherein the key terms are identified using human language-specific variants of a combination of separate words in a particular human language;

(c) defining at least one spam message category from the hierarchic list of the spam message categories for which (i) a weight factor of a morphologically transformed text message exceeds a first pre-determined threshold or (ii) a similarity score of the text message exceeds a second pre-determined threshold, wherein the weight factor value and the similarity score value are compared against the respective threshold values using a precise matching comparison; and

(d) associating with the at least one spam message category the text message having (i) the weight factor value exceeding the first threshold or (ii) the similarity score value exceeding the second threshold, wherein the properties of the key terms within the spam message categories are any of;

a frequency of occurrence of the key term within the message;

a location of the key term within the message; and

a number of separate words in the key term.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for classifying electronic text messages include creating a hierarchical list of message categories, composing databases of key terms and sample phrases for each of such categories, and, based on a number and features of the key terms detected in an analyzed text message, determining if the text message is associated with at least one message category of interest. Variants of the key terms or can be produced using fuzzy text objects generation algorithms. Weight factors for the key terms and similarity scores of a text message compared to previously identified sample messages for a particular message category are calculated based on properties of the key terms detected in the text message, such as a frequency of use, location, or appearance in the text message, a number of words in the respective key terms.

Citations

20 Claims

1. A method of identifying electronic text messages as spam, the method comprising:
- (a) creating a hierarchic list of spam message categories and sub-categories, wherein the hierarchic list defines properties of key terms within the spam message categories and sub-categories;
  
  (b) composing a database of the key terms and a database of sample messages in a human language for each of the spam message categories and message templates for sub-categories, wherein the key terms are identified using human language-specific variants of a combination of separate words in a particular human language;
  
  (c) defining at least one spam message category from the hierarchic list of the spam message categories for which (i) a weight factor of a morphologically transformed text message exceeds a first pre-determined threshold or (ii) a similarity score of the text message exceeds a second pre-determined threshold, wherein the weight factor value and the similarity score value are compared against the respective threshold values using a precise matching comparison; and
  
  (d) associating with the at least one spam message category the text message having (i) the weight factor value exceeding the first threshold or (ii) the similarity score value exceeding the second threshold, wherein the properties of the key terms within the spam message categories are any of;
  
  a frequency of occurrence of the key term within the message;
  
  a location of the key term within the message; and
  
  a number of separate words in the key term.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 18, 19, 20)
- - 2. The method of claim 1, wherein the text messages are electronically acquired text documents or portions thereof.
  - 3. The method of claim 1, wherein the text messages are electronic mail (email) messages or portions thereof.
  - 4. The method of claim 1, wherein each of the key terms comprises at least one separate word in a particular human language.
  - 5. The method of claim 1, wherein step (b) further comprises producing variants of the key terms or sample spam messages using at least one fuzzy text objects generation algorithm.
  - 6. The method of claim 1, wherein step (c) comprises:
    - identifying human language(s) of the text message; and
      
      ignoring, in the text message, pre-selected separate words in the identified human language.
  - 7. The method of claim 6, further comprising at least one of:
    - calculating the weight factor of the text message as a sum or a normalized sum of weight factors of the key terms identified in the text message based on the identified human language; and
      
      calculating the similarity score of the text message as a sum or a normalized sum of similarity scores relative to a plurality of previously identified sample messages in the identified human language.
  - 8. The method of claim 6, further comprising:
    - determining the weight factors of the key terms based on their frequency of use or location in the text message, or a number of words in a key term; and
      
      identifying characteristic key terms that correspond to a particular message category from the hierarchic list of spam message categories.
  - 9. The method of claim 1, wherein step (d) further comprises isolating or deleting the text message or portion thereof associated with a pre-selected category from the hierarchic list of spam message categories.
  - 10. The method of claim 1, wherein morphological transforming of the text message reduces the words to their dictionary variations and removes noise words.
  - 11. The method of claim 1, further comprising identifying a primary language of the text message, and normalizing the letters in the text message to their English encoding.
  - 12. A non-transitory computer useable recording medium storing computer executable program logic that, when executed by a processor, causes a computer system to perform the steps of the method of claim 1.
  - 18. The non-transitory computer useable recording medium of claim 12, wherein the recording medium contains software code for defining the at least one message category and executes the steps of:
    - identifying a human language(s) of the text message;
      
      ignoring in the text message pre-selected words, symbols, or combinations thereof;
      
      calculating the weight factor of the text message as a sum or a normalized sum of weight factors of the key terms identified in the text message;
      
      calculating the similarity score of the text message as a sum or a normalized sum of similarity scores of the sample messages identified as similar to the text message; and
      
      isolating or deleting the text message or portion thereof associated with a pre-selected category from the hierarchic list of message categories.
  - 19. The non-transitory computer useable recording medium of claim 18, wherein the weight factors of the key terms are determined based on their frequency of use, location, or appearance in the text message, or a number of words in a key term, a sample phrase, or the text message.
  - 20. The method of claim 1, wherein step (c) comprises:
    - identifying human language(s) of the text message; and
      
      ignoring, in the text message pre-selected symbols, or combinations of symbols.

13. A system for classifying electronic text messages, the system comprising:
- a processor; and
  
  a non-transitory memory device storing instructions of an operating system and an application program that, when executed by the processor, is adapted to provide;
  
  (a) a hierarchic list of spam message categories and sub-categories, wherein the hierarchic list defines properties of key terms within the spam message categories;
  
  (b) a database of the key terms and a database of sample messages in a human language for each of the spam message categories and message templates for sub-categories, wherein the key terms are identified using human language-specific variants of a combination of separate words in a particular human language;
  
  (c) wherein the system defines at least one spam message category from the hierarchic list of the spam message categories for which (i) a weight factor of a morphologically transformed text message exceeds a first pre-determined threshold or (ii) a similarity score of the text message exceeds a second pre-determined threshold, wherein the weight factor value and the similarity score value are compared against the respective threshold values using precise matching comparison; and
  
  (d) the at least one spam message category is associated with the text message having (i) the weight factor value exceeding the first threshold or (ii) the similarity score value exceeding the second threshold, wherein the properties of the key terms within the spam message categories are any of;
  
  a frequency of occurrence of the key term within the message;
  
  a location of the key term within the message; and
  
  a number of separate words in the key term.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The system of claim 13, wherein the text messages are electronically acquired text documents, electronic mail (email) messages, or portions thereof.
  - 15. The system of claim 13, wherein at least one of the databases comprise variants of the key terms or sample phrases, the variants produced using at least one fuzzy text objects generation algorithm.
  - 16. The system of claim 13, wherein:
    - the weight factor of the text message is a sum or a normalized sum of weight factors of the key terms identified in the text message; and
      
      the similarity score of the text message is a sum or a normalized sum of similarity scores relative to a plurality of previously identified sample messages.
  - 17. The system of claim 16, wherein:
    - the weight factors of the key terms or the similarity scores of the sample phrases are determined based on their frequency of use, location, or appearance in the text message, or a number of words in a key term, a sample phrase, or the text message; and
      
      specific key terms or sample phrases uniquely identify a particular message category from the hierarchic list of message categories.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kaspersky Lab ZAO
Original Assignee
Kaspersky Lab ZAO
Inventors
Zorky, Kirill P.
Primary Examiner(s)
Wong, Don
Assistant Examiner(s)
Nguyen, Kim T

Application Number

US11/967,144
Time in Patent Office

1,053 Days
Field of Search

707/749, 707/750
US Class Current

707/749
CPC Class Codes

G06F 16/353 into predefined classes

Method and system for classifying electronic text messages and spam messages

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for classifying electronic text messages and spam messages

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links