Systems and methods for generating signatures for electronic communication classification
First Claim
1. A computer system comprising a memory storing instructions which, when executed, cause the computer system to form:
- a message aggregator configured to assign messages of a spam message corpus to a plurality of spam message clusters, the plurality of spam message clusters including a first and a second spam message cluster, wherein the message aggregator is configured to compute a hyperspace representation of a message of the spam message corpus, and to assign the message to a selected cluster according to a hyperspace distance between the hyperspace representation and a center of the selected cluster;
a pattern extractor connected to the message aggregator and configured to, in response to assigning the messages to the plurality of spam message clusters, extract a first set of cluster-specific spam identification text patterns from members of the first spam message cluster; and
a spam identification signature builder connected to the pattern extractor and configured to combine a first subset of the first set of cluster-specific spam identification text patterns into a first set of spam identification signatures for the first spam message cluster, wherein each spam identification signature of the first set of spam identification signatures includes a predetermined conjunction of at least two spam identification text patterns of the first subset of the first set of cluster-specific spam identification text patterns.
2 Assignments
0 Petitions
Accused Products
Abstract
In some embodiments, fully-automated spam identification is facilitated by accelerating a signature extraction process, allowing the use of a relatively large number of signatures finely tailored to individual spam waves, rather than a smaller number of highly-accurate signatures generated under human supervision. The signature extraction process is performed in a distributed manner. A message corpus is classified into a plurality of message clusters. Cluster-specific spam identification text patterns are extracted selectively from members of each cluster, and the text patterns are combined into cluster-specific spam identification signatures. A cluster may represent an individual spam wave. Genetic algorithms are used to optimize the set of spam identification signatures by selecting the highest-performing combinations of cluster-specific spam identification text patterns. Performing signature extraction at a subclass level allows accelerating the signature extraction process, which in turn allows frequent signature updates and facilitates fully automated spam identification.
207 Citations
24 Claims
-
1. A computer system comprising a memory storing instructions which, when executed, cause the computer system to form:
- a message aggregator configured to assign messages of a spam message corpus to a plurality of spam message clusters, the plurality of spam message clusters including a first and a second spam message cluster, wherein the message aggregator is configured to compute a hyperspace representation of a message of the spam message corpus, and to assign the message to a selected cluster according to a hyperspace distance between the hyperspace representation and a center of the selected cluster;
a pattern extractor connected to the message aggregator and configured to, in response to assigning the messages to the plurality of spam message clusters, extract a first set of cluster-specific spam identification text patterns from members of the first spam message cluster; and a spam identification signature builder connected to the pattern extractor and configured to combine a first subset of the first set of cluster-specific spam identification text patterns into a first set of spam identification signatures for the first spam message cluster, wherein each spam identification signature of the first set of spam identification signatures includes a predetermined conjunction of at least two spam identification text patterns of the first subset of the first set of cluster-specific spam identification text patterns. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 24)
- a message aggregator configured to assign messages of a spam message corpus to a plurality of spam message clusters, the plurality of spam message clusters including a first and a second spam message cluster, wherein the message aggregator is configured to compute a hyperspace representation of a message of the spam message corpus, and to assign the message to a selected cluster according to a hyperspace distance between the hyperspace representation and a center of the selected cluster;
-
11. A computer-implemented method comprising:
-
assigning messages of a spam message corpus to a plurality of spam message clusters, the plurality of spam message clusters including a first and a second spam message cluster, wherein assigning messages of the spam message corpus to the plurality of spam message clusters comprises computing a hyperspace representation of a message of the spam message corpus, and assigning the message to a selected cluster according to a hyperspace distance between the hyperspace representation and a center of the selected cluster; in response to assigning the messages to the plurality of spam message clusters, extracting a first set of cluster-specific spam identification text patterns from members of the first spam message cluster; and combining a first subset of the first set of cluster-specific spam identification text patterns into a first set of spam identification signatures for the first spam message cluster, wherein each spam identification signature of the first set of spam identification signatures includes a predetermined conjunction of at least two spam identification text patterns of the first subset of the first set of cluster-specific spam identification text patterns. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-implemented spam-filtering method comprising:
-
receiving a set of cluster-specific spam identification signatures, wherein the cluster-specific spam identification signatures are generated by;
assigning messages of a spam message corpus to a plurality of spam message clusters including a first and second spam message cluster, wherein assigning messages of the spam message corpus to the plurality of spam message clusters comprises computing a hyperspace representation of a message of the spam message corpus, and assigning the message to a selected cluster according to a hyperspace distance between the hyperspace representation and a center of the selected cluster;in response to assigning the messages to the plurality of spam message clusters, extracting a set of cluster-specific spam identification text patterns from members of the first spam message cluster; and
combining a subset of the set of cluster-specific spam identification text patterns into a set of cluster-specific spam identification signatures for the first spam message cluster, wherein each spam identification signature includes a predetermined conjunction of at least two spam identification text patterns; and
deciding whether an incoming message is spam or non-spam according to the cluster-specific spam identification signatures.
-
-
22. A non-transitory computer-readable storage medium encoding instructions which, when executed on a computer system, cause the computer system to perform the steps of:
-
assigning messages of a spam message corpus to a plurality of spam message clusters, the plurality of spam message clusters including a first and a second spam message cluster, wherein assigning messages of the spam message corpus to the plurality of spam message clusters comprises computing a hyperspace representation of a message of the spam message corpus, and assigning the message to a selected cluster according to a hyperspace distance between the hyperspace representation and a center of the selected cluster; in response to assigning the messages to the plurality of spam message clusters, extracting a set of cluster-specific spam identification text patterns from members of the first spam message cluster; and combining a subset of the set of cluster-specific spam identification text patterns into a set of spam identification signatures for the first spam message cluster, wherein each spam identification signature of the set of spam identification signatures includes a predetermined conjunction of at least two spam identification text patterns of the subset of the set of cluster-specific spam identification text patterns.
-
-
23. A non-transitory computer-readable storage medium encoding instructions which, when executed on a computer system, cause the computer system to perform the steps of:
-
receiving a set of cluster-specific spam identification signatures, wherein the cluster-specific spam identification signatures are generated by; assigning messages of a spam message corpus to a plurality of spam message clusters including a first and second spam message cluster, wherein assigning messages of the spam message corpus to the plurality of spam message clusters comprises computing a hyperspace representation of a message of the spam message corpus, and assigning the message to a selected cluster according to a hyperspace distance between the hyperspace representation and a center of the selected cluster; in response to assigning the messages to the plurality of spam message clusters, extracting a set of cluster-specific spam identification text patterns from members of the first spam message cluster; and combining a subset of the set of cluster-specific spam identification text patterns into a set of cluster-specific spam identification signatures for the first spam message cluster, wherein each spam identification signature includes a predetermined conjunction of at least two spam identification text patterns; and
deciding whether an incoming message is spam or non-spam according to the cluster-specific spam identification signatures.
-
Specification