System and method for the classification of electronic communication
First Claim
1. A method for processing digital messages on an electronic communication system, each message having a header and a body, comprising:
- identifying a set of characteristics of a first message, the set including;
addresses extracted from the header and body of the message; and
a condensed representation of the message body produced by;
eliminating message content not perceptible in the normal display mode of the message;
converting the perceptible message content to a standardized format characterized by limited degeneracy;
generating a plurality of hash values which represents the converted content of the message body;
storing the set of identified characteristics of the first message in a first bulk message envelope, the first bulk message envelope including a frequency index;
identifying the same set of characteristics of a second message;
comparing the set of identified characteristics of the second message to the first bulk message envelope;
upon determining that the second message has characteristics dissimilar to those of the first bulk message envelope, storing the set of identified characteristics of the second message in a second bulk message envelope, the second bulk message envelope including a frequency index with a unitary value;
upon determining that the second message has characteristics similar to those of the first bulk message envelope, increasing the frequency index of the first bulk message envelope by a unitary increment.
3 Assignments
0 Petitions
Accused Products
Abstract
From an electronic message, we extract any destinations in selectable links, and we reduce the message to a “canonical” (standard) form that we define. It minimizes the possible variability that a spammer can introduce, to produce unique copies of a message. We then make multiple hashes. These can be compared with those from messages received by different users to objectively find bulk messages. From these, we build hash tables of bulk messages and make a list of destinations from the most frequent messages. The destinations can be used in a Real time Blacklist (RBL) against links in bodies of messages. Similarly, the hash tables can be used to identify other messages as bulk or spam. Our method can be used by a message provider or group of users (where the group can do so in a p2p fashion) independently of whether any other provider or group does so. Each user can maintain a “gray list” of bulk mail senders that she subscribes to, to distinguish between wanted bulk mail and unwanted bulk mail (spam). The gray list can be used instead of a whitelist, and is far easier for the user to maintain.
-
Citations
23 Claims
-
1. A method for processing digital messages on an electronic communication system, each message having a header and a body, comprising:
- identifying a set of characteristics of a first message, the set including;
addresses extracted from the header and body of the message; and
a condensed representation of the message body produced by;
eliminating message content not perceptible in the normal display mode of the message;
converting the perceptible message content to a standardized format characterized by limited degeneracy;
generating a plurality of hash values which represents the converted content of the message body;
storing the set of identified characteristics of the first message in a first bulk message envelope, the first bulk message envelope including a frequency index;
identifying the same set of characteristics of a second message;
comparing the set of identified characteristics of the second message to the first bulk message envelope;
upon determining that the second message has characteristics dissimilar to those of the first bulk message envelope, storing the set of identified characteristics of the second message in a second bulk message envelope, the second bulk message envelope including a frequency index with a unitary value;
upon determining that the second message has characteristics similar to those of the first bulk message envelope, increasing the frequency index of the first bulk message envelope by a unitary increment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- identifying a set of characteristics of a first message, the set including;
-
12. An electronic communication system comprising interconnected entities for transmission and receipt of messages, the system comprising, in at least one of said entities, a subsystem for processing messages comprising:
a unit which identifies a set of characteristics of a message;
a memory which stores the set of identified characteristics of messages in a plurality of bulk message envelopes, each bulk message envelope including a frequency index;
a unit which compares the set of identified characteristics of a message to the bulk message envelopes, and if the identified characteristics are similar to a stored bulk message envelope, increasing the frequency index of the bulk message envelope in response, and if the identified characteristics are dissimilar to any stored bulk message envelope, causing the set of identified characteristics to be stored in the memory as an additional bulk message envelope.
-
13. A computer program embodied on a computer-readable medium and/or memory device for providing a subsystem for processing messages comprising:
an identification segment for extracting a set of characteristics of a message;
a storage segment for storing the set of identified characteristics of a message in a bulk message envelope, each bulk message envelope including a frequency index;
a comparison segment for comparing the set of identified characteristics of a message to the bulk message envelopes, and if the identified characteristics are similar to a stored bulk message envelope, increasing the frequency index of the bulk message envelope by a unitary increment, and if the identified characteristics are dissimilar to any stored bulk message envelope, causing the set of identified characteristics to be stored in the memory as an additional bulk message envelope having a frequency index with a unitary value.
-
14. An article of manufacture comprising:
-
a machine readable medium and/or memory device that provides instructions that, if executed by a machine operatively connected to an electronic messaging system, will cause the machine to perform operations including;
identifying a set of characteristics of a first message;
storing the set of identified characteristics of the first message in a first bulk message envelope, the first bulk message envelope including a frequency index;
identifying the same set of characteristics of a second message;
comparing the set of identified characteristics of the second message to the first bulk message envelope;
upon determining that the second message has characteristics similar to those of the first bulk message envelope, increasing the frequency index of the first bulk message envelope by a unitary value;
upon determining that the second message has characteristics dissimilar to those of the first bulk message envelope, storing the set of identified characteristics of the second message in a second bulk message envelope, the second bulk message envelope including a frequency index with a unitary value.
-
Specification