System and method for spam filtering using shingles
First Claim
1. A computer-implemented method for detecting spam, the method comprising:
- receiving an electronic message;
identifying in the received message one or more insignificant text portions based on a text pattern database storing a plurality of defined insignificant text patterns not containing spam, each defined insignificant text pattern comprising a text pattern, text identification information and a usage frequency;
removing the one or more identified insignificant text portions from the message to generate an abridged message upon detecting the one or more identified insignificant text portions matching at least one of the plurality of defined insignificant text patterns;
canonizing text of the abridge message;
generating a set of shingles from the abridged and canonized message;
identifying in generated set of shingles one or more shingles based on a shingles database storing a plurality of defined insignificant shingles that occur only in messages not containing spam, each defined insignificant shingle comprising a hash, a shingle pattern, text identification information corresponding to the shingle pattern, and a usage frequency;
removing one or more identified shingles from the generated set of shingles to generate a reduced set of shingles upon detecting the one or more identified shingles matching at least one of the plurality of defined shingles; and
performing spam filtering of the reduced set of shingles to determine whether the received message contains spam.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed system and methods for detecting spam using shingles. In one aspect, the system receives an electronic message including at least a text portion. The system identifies in the received message insignificant text portions. The system then removes identified insignificant text portions to generate an abridged message. The system then generates a set of shingles from the abridged message. The system then indentifies in the generated set of shingles one or more shingles that occur only in messages not containing spam. The system then removes one or more identified shingles from the generated set of shingles to generate a reduced set of shingles. The system then performs spam filtering of the reduced set of shingles to determine whether the received message contains spam.
-
Citations
15 Claims
-
1. A computer-implemented method for detecting spam, the method comprising:
-
receiving an electronic message; identifying in the received message one or more insignificant text portions based on a text pattern database storing a plurality of defined insignificant text patterns not containing spam, each defined insignificant text pattern comprising a text pattern, text identification information and a usage frequency; removing the one or more identified insignificant text portions from the message to generate an abridged message upon detecting the one or more identified insignificant text portions matching at least one of the plurality of defined insignificant text patterns; canonizing text of the abridge message; generating a set of shingles from the abridged and canonized message; identifying in generated set of shingles one or more shingles based on a shingles database storing a plurality of defined insignificant shingles that occur only in messages not containing spam, each defined insignificant shingle comprising a hash, a shingle pattern, text identification information corresponding to the shingle pattern, and a usage frequency; removing one or more identified shingles from the generated set of shingles to generate a reduced set of shingles upon detecting the one or more identified shingles matching at least one of the plurality of defined shingles; and performing spam filtering of the reduced set of shingles to determine whether the received message contains spam. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer system for detecting spam, the system comprising:
a processor configured to; receive an electronic message; identify in the received message one or more insignificant text portions based on a text pattern database storing a plurality of defined insignificant text patterns not containing spam, each defined insignificant text pattern comprising a text pattern, text identification information and a usage frequency; remove the one or more identified insignificant text portions from the message to generate an abridged message upon detecting the one or more identified insignificant text portions matching at least one of the plurality of defined insignificant text patterns; canonize text of the abridge message; generate a set of shingles from the abridged and canonized message; identify in the generated set of shingles one or more shingles based on a shingles database storing a plurality of defined insignificant shingles that occur only in messages not containing spam, each defined insignificant shingle comprising a hash, a shingle pattern, text identification information corresponding to the shingle pattern, and a usage frequency; remove one or more identified shingles from the generated set of shingles to generate a reduced set of shingles upon detecting the one or more identified shingles matching at least one of the plurality of defined shingles; and perform spam filtering of the reduced set of shingles to determine whether the received message contains spam. - View Dependent Claims (7, 8, 9, 10)
-
11. A computer program product stored on a non-transitory computer-readable storage medium, the computer program product comprising computer-executable instructions for detecting spam, including instructions for:
-
receiving an electronic message; identifying in the received message one or more insignificant text portions based on a text pattern database storing a plurality of defined insignificant text patterns not containing spam, each defined insignificant text pattern comprising a text pattern, text identification information and a usage frequency; removing the one or more identified insignificant text portions from the message to generate an abridged message upon detecting the one or more identified insignificant text portions matching at least one of the plurality of defined insignificant text patterns; canonizing text of the abridge message; generating a set of shingles from the abridged and canonized message; identifying in the generated set of shingles one or more shingles based on a shingles database storing a plurality of defined insignificant shingles that occur only in messages not containing spam, each defined insignificant shingle comprising a hash, a shingle pattern, text identification information corresponding to the shingle pattern, and a usage frequency; removing one or more identified shingles from the generated set of shingles to generate a reduced set of shingles upon detecting the one or more identified shingles matching at least one of the plurality of defined shingles; and performing spam filtering of the reduced set of shingles determine whether the received message contains spam. - View Dependent Claims (12, 13, 14, 15)
-
Specification