(More) advanced spam detection features
First Claim
1. A system that facilitates extracting data in connection with spam processing, comprising:
- a component that receives a message and extracts a set of features associated with some part, content or content type of a message; and
an analysis component that at least examines consecutiveness of characters within a subject line of the message in connection with building a filter.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention involves a system and method that facilitate extracting data from messages for spam filtering. The extracted data can be in the form of features, which can be employed in connection with machine learning systems to build improved filters. Data associated with the subject line, timestamps, and the message body can be extracted and employed to generate one or more features. In particular, subject lines and message bodies can be examined for consecutive, repeating characters, blobs, the association or distance between such characters, blobs and non-blob portions of the message. The values or counts obtained can be broken down into one or more ranges corresponding to a degree of spaminess. Presence and type of attachments to messages, percentage of non-white-space and non-numeric characters of a message, and determining message delivery times can be used to identify spam. A time-based delta can be computed to facilitate determining the delivery time.
173 Citations
41 Claims
-
1. A system that facilitates extracting data in connection with spam processing, comprising:
-
a component that receives a message and extracts a set of features associated with some part, content or content type of a message; and
an analysis component that at least examines consecutiveness of characters within a subject line of the message in connection with building a filter. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system that facilitates extracting data in connection with spam processing, comprising:
-
a component that receives an item and extracts a set of features associated with a message; and
an analysis component that determines whether an embedded message or attachment is associated with the message. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A method that facilitates spam detection and prevention comprising:
-
receiving a plurality of messages, the plurality comprising at least a first and a second message;
extracting at least a subset of information from the plurality of messages, the information being from at least one of a subject line, a content-type header, a received header, and a message body; and
analyzing the subset of information to generate one or more features to facilitate training a filter. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
-
-
40. A computer-readable medium having stored thereon the following computer executable components:
-
a component that receives a message and extracts a set of features associated with some part, content or content type of a message;
an analysis component that examines at least consecutiveness of characters within a subject line of the message in connection with building a filter;
a component that determines whether an embedded message or attachment is associated with the message; and
a component that determines a percentage or a number of consecutive lines of a message body to examine and that examines the message body for the presence of at least one blob or consecutive, repeating characters.
-
-
41. A system that facilitates printing from a web page comprising:
-
means for receiving a plurality of messages, the plurality comprising at least a first and a second message;
means for extracting at least a subset of information from the plurality of messages, the information being from at least one of a subject line, a content-type header, a received header, and a message body; and
means for analyzing the subset of information to generate one or more features to facilitate training a filter, the means for analyzing the subset of information comprising;
means for determining a number of consecutive repeating characters within the subject line or the message body of the message;
means for determining a delta between a first time stamp and a last time stamp associated with the messasge;
means for determining whether an embedded message or an attachment exists in the message and identifying a type of embedded message or attachment to facilitate predicting whether the message is spam; and
means for determining a percentage or a number of consecutive lines of a message body to examine at least one of;
a percentage of white space to non-white space in the subject line of the message and a percentage of non-white space and non-numeric characters that are not letters in the subject line of the message.
-
Specification