(More) advanced spam detection features
First Claim
1. A system that facilitates extracting data in connection with spam processing, comprising:
- a processing unit; and
a memory for storing computer-executable instructions that when executed by the processing unit executes;
a component comprising software that receives a message and extracts a set of features associated with some part, content or content type of a message; and
an analysis component comprising software that examines(1) consecutiveness of characters within a subject line of the message, wherein the analysis component establishes ranges of consecutive, repeating characters, the ranges corresponding to varying degrees of spaminess, whereby messages can be sorted by their respective individual count of consecutive repeating characters and,(2) a content type of the message for spam in connection with building a filter, wherein the content type describes a type of data contained within a body of the message, the content type being case-sensitive and comprising a primary content-type, a secondary-content type, or a combination thereof, the primary content-type and the secondary-content type comprising at least one of a text, a multipart, a message, an image, an audio, a video, or an application, wherein the analysis component compares the content type of the message to stored content types of a plurality of other messages to facilitate determining whether the message is spam.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention involves a system and method that facilitate extracting data from messages for spam filtering. The extracted data can be in the form of features, which can be employed in connection with machine learning systems to build improved filters. Data associated with the subject line, timestamps, and the message body can be extracted and employed to generate one or more features. In particular, subject lines and message bodies can be examined for consecutive, repeating characters, blobs, the association or distance between such characters, blobs and non-blob portions of the message. The values or counts obtained can be broken down into one or more ranges corresponding to a degree of spaminess. Presence and type of attachments to messages, percentage of non-white-space and non-numeric characters of a message, and determining message delivery times can be used to identify spam. A time-based delta can be computed to facilitate determining the delivery time.
-
Citations
16 Claims
-
1. A system that facilitates extracting data in connection with spam processing, comprising:
-
a processing unit; and a memory for storing computer-executable instructions that when executed by the processing unit executes; a component comprising software that receives a message and extracts a set of features associated with some part, content or content type of a message; and an analysis component comprising software that examines (1) consecutiveness of characters within a subject line of the message, wherein the analysis component establishes ranges of consecutive, repeating characters, the ranges corresponding to varying degrees of spaminess, whereby messages can be sorted by their respective individual count of consecutive repeating characters and, (2) a content type of the message for spam in connection with building a filter, wherein the content type describes a type of data contained within a body of the message, the content type being case-sensitive and comprising a primary content-type, a secondary-content type, or a combination thereof, the primary content-type and the secondary-content type comprising at least one of a text, a multipart, a message, an image, an audio, a video, or an application, wherein the analysis component compares the content type of the message to stored content types of a plurality of other messages to facilitate determining whether the message is spam. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for evaluating spam as a function of message content, comprising:
-
employing a processor executing computer readable instructions stored on a computer readable storage medium to implement the following; parsing a message to extract a set of features associated with a content type of the message, wherein the content type describes the type of data contained within a body of the message, the content type being case-sensitive and comprising a primary content-type, a secondary content-type, or a combination thereof; examining the extracted set of features to identify a frequency of consecutiveness of repeating characters within a subject line of the message and to identify a distance of white-space characters between at least one alpha-numeric character and a blob comprising a random sequence of characters, numbers, punctuation, or a combination thereof to classify the message as spam or not spam; establishing ranges of consecutive, repeating characters, the ranges correspond to various degrees of spaminess, wherein each range comprises a number range of frequencies of the consecutive, repeating characters within the subject line of the message; employing the ranges to sort the message by the frequency of consecutive repeating characters within the subject line of the message; comparing the content type of the message to stored content types of a plurality of other messages to facilitate determining whether the message is spam; and processing the message as a function of the classification. - View Dependent Claims (15)
-
-
16. One or more computer-readable storage devices having computer-executable instructions embodied thereon that, when executed, perform a method for facilitating extracting data in connection with spam processing, comprising:
-
receiving a message;
determining a particular portion of a body of the message to analyze;extracting a set of features associated with some part, content or content type of the message; examining consecutiveness of characters within a subject line of the message and identifying a distance comprising a number of white-space characters between at least one alpha-numeric character and a blob comprising a random sequence of characters, numbers, punctuation, or a combination thereof; examining a content type of the message for spam in connection with building a filter, wherein the content type describes data contained within the body of the message, the content type being case-sensitive to capture a variation of a primary content-type, a secondary-content type, or a combination thereof, each of the primary content-type and the secondary-content type comprising one of a text, a multipart, a message, an image, an audio, a video, or an application; comparing the content type of the message to stored content types of a plurality of other messages to facilitate determining whether the message is spam; determining a percentage of white space to non-white space in the message and a percentage of non-white space and nonnumeric characters that are not letters in the message; calculating a delivery time for the message using a first timestamp associated with origination of the message and a second timestamp associated with receipt of the message; and categorizing the delivery time into one of a plurality of ranges comprising a range of amounts of time for delivering messages, the ranges corresponding to various degrees of spaminess.
-
Specification