Identifying undesired email messages having attachments
First Claim
Patent Images
1. A method comprising the steps of:
- (A) receiving an email message from a simple mail transfer protocol (SMTP) server, the email message comprising;
(A1) a 32-bit string indicative of the length of the email message;
(A2) a text body;
(A3) an SMTP email address;
(A4) a domain name corresponding to the SMTP email address;
(A5) an attachment;
(B) tokenizing the text body to generate tokens representative of words in the text;
(C) tokenizing the SMTP email address to generate a token representative of the SMTP email address;
(D) tokenizing the domain name to generate a token that is representative domain name;
(E) tokenizing the attachment to generate a token that is representative of the attachment, the tokenizing step comprising the steps of;
(E1) generating a 128-bit MD5 hash of the attachment;
(E2) appending the 32-bit string to the generated MD5 hash to produce a 160-bit number; and
(E3) UUencoding the 160-bit number to generate the token representative of the attachment;
(F) determining a probability value for each of the generated tokens;
(G) selecting a predefined number of interesting tokens, the interesting tokens being the generated tokens having the greatest non-neutral probability values;
(H) performing a Bayesian analysis on the selected interesting tokens to generate a spam probability; and
(I) categorizing the email message as a function of the generated spam probability.
1 Assignment
0 Petitions
Accused Products
Abstract
Several embodiments, among others, provided in the present disclosure provide for tokenizing portions of an email message, which previously were not tokenized. The tokenizing of these portions generates tokens that are representative of these portions. The generated tokens are used to determine whether or not the email message is spam. In some embodiments, the tokenized portions may include attachments in email messages. In other embodiments, the tokenized portions may include a simple mail transfer protocol (SMTP) email address and a domain name corresponding to the SMTP email address.
187 Citations
38 Claims
-
1. A method comprising the steps of:
-
(A) receiving an email message from a simple mail transfer protocol (SMTP) server, the email message comprising;
(A1) a 32-bit string indicative of the length of the email message;
(A2) a text body;
(A3) an SMTP email address;
(A4) a domain name corresponding to the SMTP email address;
(A5) an attachment;
(B) tokenizing the text body to generate tokens representative of words in the text;
(C) tokenizing the SMTP email address to generate a token representative of the SMTP email address;
(D) tokenizing the domain name to generate a token that is representative domain name;
(E) tokenizing the attachment to generate a token that is representative of the attachment, the tokenizing step comprising the steps of;
(E1) generating a 128-bit MD5 hash of the attachment;
(E2) appending the 32-bit string to the generated MD5 hash to produce a 160-bit number; and
(E3) UUencoding the 160-bit number to generate the token representative of the attachment;
(F) determining a probability value for each of the generated tokens;
(G) selecting a predefined number of interesting tokens, the interesting tokens being the generated tokens having the greatest non-neutral probability values;
(H) performing a Bayesian analysis on the selected interesting tokens to generate a spam probability; and
(I) categorizing the email message as a function of the generated spam probability.
-
-
2. A method comprising the steps of:
-
receiving an email message comprising a text body having non-displaying characters;
removing the non-displaying characters from the text body to generate a displayable text body;
tokenizing the words in the displayable text body to generate tokens representative of the displayable text body. - View Dependent Claims (3, 4, 5)
-
-
6. A method comprising the steps of:
-
receiving an email message comprising a text body, an SMTP email address, and a domain name corresponding to the SMTP email address;
tokenizing the SMTP email address to generate a token representative of the SMTP email address;
tokenizing the domain name to generate a token representative of the domain name; and
determining a spam probability from the generated tokens. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method comprising the steps of:
-
receiving an email message comprising an attachment;
tokenizing the attachment to generate a token representative of the attachment; and
determining a spam probability from the generated token. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
-
-
23. A system comprising:
-
email receive logic configured to receive an email message comprising an SMTP email address and a domain name corresponding to the SMTP email address;
tokenize logic configured to tokenize the SMTP email address to generate a token representative of the SMTP email address;
tokenize logic configured to tokenize the domain name to generate a token representative of the domain name; and
analysis logic configured to determine a spam probability from the generated tokens.
-
-
24. A system comprising:
-
means for receiving an email message comprising an SMTP email address and a domain name corresponding to the SMTP email address;
means for tokenizing the SMTP email address to generate a token representative of the SMTP email address;
means for tokenizing the domain name to generate a token representative of the domain name; and
means for determining a spam probability from the generated tokens.
-
-
25. A computer-readable medium comprising:
-
computer-readable code adapted to instruct a programmable device to receive an email message comprising an SMTP email address and a domain name corresponding to the SMTP email address;
computer-readable code adapted to instruct a programmable device to tokenize the SMTP email address to generate a token representative of the SMTP email address;
computer-readable code adapted to instruct a programmable device to tokenize the domain name to generate a token representative of the domain name; and
computer-readable code adapted to instruct a programmable device to determine a spam probability from the generated tokens. - View Dependent Claims (26, 27, 28, 29)
-
-
30. A system comprising:
-
email receive logic configured to receive an email message comprising an attachment;
tokenize logic configured to tokenize the attachment to generate a token representative of the attachment; and
analysis logic configured to determine a spam probability from the generated token.
-
-
31. A system comprising:
-
means for receiving an email message comprising an attachment;
means for tokenizing the attachment to generate a token representative of the attachment; and
means for determining a spam probability from the generated token.
-
-
32. A computer-readable medium comprising:
-
computer-readable code adapted to instruct a programmable device to receive an email message comprising an attachment;
computer-readable code adapted to instruct a programmable device to tokenize the attachment to generate a token representative of the attachment; and
computer-readable code adapted to instruct a programmable device to determine a spam probability from the generated token. - View Dependent Claims (33, 34, 35, 36, 37, 38)
-
Specification