System and method for processing a message store for near duplicate messages
First Claim
1. A system for processing a message store for near duplicate messages, comprising:
- a deduper module configured to identify near duplicate messages in a plurality of messages in a message store, comprising;
a message digester module configured to generate a message sequence taken from metadata for, and of content contained in, each of the messages and to generate an attachment sequence for at least part of at least one attachment associated with one or more of the messages;
a concatenator module configured to concatenate the message sequence and the attachment sequence into a compound digest for each message;
a comparer module configured to compare the message sequences and the compound digests of the messages in the message store; and
a message sequence marker module configured to mark each such message having a message sequence not matching the message sequence of any other such message as unique; and
mark each such message having a message sequence matching the message sequence of at least one other such message as an exact duplicate; and
a classifier module configured to group those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages, wherein the marker is further configured to designate a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate;
a compound sequence maker module to mark each exact duplicate message and each near duplicate message having a compound digest not matching any other compound digest as a unique message; and
a processor to execute each of the modules, which are stored on a computer-readable storage medium.
10 Assignments
0 Petitions
Accused Products
Abstract
A system and method for processing a message store for near duplicate messages is provided. Metadata, content, and each attachment associated with messages are extracted. Near duplicate messages in the message store are identified. Compound digests taken of the metadata for, of the content contained in, and of the each attachment associated with each of the messages in the message store are compared. Each message having a compound digest not matching the compound digest of any other message is marked as unique and each message having a compound digest matching the compound digest of at least one other message is marked as an exact duplicate. Messages remaining unmarked and having similar content are grouped into sets that each includes one or more near duplicate messages. One of the near duplicate messages is designated as unique and each remaining near duplicate message in the set is designated as a near duplicate.
-
Citations
20 Claims
-
1. A system for processing a message store for near duplicate messages, comprising:
-
a deduper module configured to identify near duplicate messages in a plurality of messages in a message store, comprising; a message digester module configured to generate a message sequence taken from metadata for, and of content contained in, each of the messages and to generate an attachment sequence for at least part of at least one attachment associated with one or more of the messages; a concatenator module configured to concatenate the message sequence and the attachment sequence into a compound digest for each message; a comparer module configured to compare the message sequences and the compound digests of the messages in the message store; and a message sequence marker module configured to mark each such message having a message sequence not matching the message sequence of any other such message as unique; and
mark each such message having a message sequence matching the message sequence of at least one other such message as an exact duplicate; anda classifier module configured to group those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages, wherein the marker is further configured to designate a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate; a compound sequence maker module to mark each exact duplicate message and each near duplicate message having a compound digest not matching any other compound digest as a unique message; and a processor to execute each of the modules, which are stored on a computer-readable storage medium. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-implemented method for processing a message store for near duplicate messages, comprising:
-
maintaining a plurality of messages in a message store; retrieving the messages from the message store and extracting metadata for, content contained in, and each attachment associated with one or more of the messages in the message store into a memory; generating a message sequence taken from the metadata for, and the content contained in, each of the messages and generating an attachment sequence for at least part of each attachment associated with the one or more of the messages; concatenating the message sequence and the attachment sequence into a compound digest for each message; identifying near duplicate messages in the message store, comprising; comparing the message sequences and the compound digests of the messages in the message store; marking each such message having a message sequence not matching the message sequence of any other such message as unique; and marking each such message having a message sequence matching the message sequence of at least one other such message as an exact duplicate; grouping those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages; designating a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate; marking each exact duplicate message and each near duplicate message having a compound digest not matching any other compound digest as a unique message; and outputting the unique, exact duplicate, and near duplicate message markings to the message store. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A system for classifying messages, comprising:
-
a metadata signature for metadata describing and a content signature signifying each of a plurality of messages in a message store; a message deduper module configured to identify those messages having both unique metadata signatures and content signatures as unique messages; and
to identify those messages having both matching metadata signatures and content signatures as exact duplicate messages;a message processor module configured to process any of the messages remaining unidentified, comprising; a comparer module configured to compare the contents of those remaining messages having matching metadata signatures; and a marker module configured to designate each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages; an attachment signature for each attachment comprised with one or more of the message; a compound signature comprising a concatenation of the metadata signature, content signature, and the attachment signature for each attachment comprised with each message, wherein the deduper module is further configured to identify those exact duplicate messages and those near duplicate messages having unique compound signatures as unique messages; and a processor to execute each of the modules, which are stored on a computer-readable storage medium. - View Dependent Claims (15, 16)
-
-
17. A computer-implemented method for classifying messages, comprising:
-
maintaining a plurality of messages in a message store; retrieving the messages from the message store into a memory; forming a metadata signature for the metadata describing and the content signature signifying each of the plurality of messages in the message store; identifying those messages having both unique metadata signatures and content signatures as unique messages; identifying those messages having both matching metadata signatures and content signatures as exact duplicate messages; processing any of the messages remaining unidentified, comprising; comparing the contents of those remaining messages having matching metadata signatures; and designating each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages; forming an attachment signature for each attachment comprised with one or more of the messages; forming a compound signature comprising a concatenation of the metadata signature, content signature, and the attachment signature for each attachment comprised with each message; identifying those exact duplicate messages and those near duplicate messages having unique compound signatures as unique messages; and outputting the unique, exact duplicate, and near duplicate message identifications to the message store. - View Dependent Claims (18, 19, 20)
-
Specification