System and method for identifying and categorizing messages extracted from archived message stores
First Claim
1. A computer-implemented system for identifying messages in a message store, comprising:
- a digester module configured to encode at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message; and
a comparer module configured to group the messages into sets by similar metadata sequences and similar content sequences and to compare the messages in each set, comprising;
a unique marker module configured to mark each such message not matching any other such message in the set as a unique message;
an exact duplicate marker module configured to mark each such message matching at least one other such message in the set as an exact duplicate message; and
a near duplicate marker module configured to mark each such message comprising a subset of at least one other such message in the set as a near duplicate message;
an attachment digester module configured to encode at least part of at least one attachment associated with one or more of the messages by generating an attachment sequence for each attachment;
a concatenator module configured to concatenate the metadata sequence and the content sequence for the message and the attachment sequence for the at least one attachment into a compound sequence;
an attachment comparer module configured to compare the compound sequences for the messages;
an attachment marker module configured to mark each exact duplicate message and each near duplicate message having a compound sequence not matching any other compound sequence in the set as a unique message; and
a processor to execute each of the modules, which are stored on a computer-readable storage medium.
12 Assignments
0 Petitions
Accused Products
Abstract
A system and method for identifying messages in a message store is provided. At least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store are encoded by generating a metadata sequence and a content sequence for each message. The messages are grouped into sets by similar metadata sequences and similar content sequences. The messages in each set are compared. Each such message not matching any other such message in the set is marked as a unique message. Each such message matching at least one other such message in the set is marked as an exact duplicate message. Each such message including a subset of at least one other such message in the set is marked as a near duplicate message.
-
Citations
8 Claims
-
1. A computer-implemented system for identifying messages in a message store, comprising:
-
a digester module configured to encode at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message; and a comparer module configured to group the messages into sets by similar metadata sequences and similar content sequences and to compare the messages in each set, comprising; a unique marker module configured to mark each such message not matching any other such message in the set as a unique message; an exact duplicate marker module configured to mark each such message matching at least one other such message in the set as an exact duplicate message; and a near duplicate marker module configured to mark each such message comprising a subset of at least one other such message in the set as a near duplicate message; an attachment digester module configured to encode at least part of at least one attachment associated with one or more of the messages by generating an attachment sequence for each attachment; a concatenator module configured to concatenate the metadata sequence and the content sequence for the message and the attachment sequence for the at least one attachment into a compound sequence; an attachment comparer module configured to compare the compound sequences for the messages; an attachment marker module configured to mark each exact duplicate message and each near duplicate message having a compound sequence not matching any other compound sequence in the set as a unique message; and a processor to execute each of the modules, which are stored on a computer-readable storage medium. - View Dependent Claims (2)
-
-
3. A computer-implemented method for identifying messages in a message store, comprising:
-
maintaining a plurality of messages in a message store; retrieving the messages from the message store and extracting at least part of metadata associated with and at least part of content contained in each of the messages into a memory; encoding the at least part of metadata associated with and the at least part of content contained in each of the messages by generating a metadata sequence and a content sequence for each message and storing the metadata sequence and content sequence in the memory; within the memory, grouping the messages into sets by similar metadata sequences and similar content sequences and comparing the messages in each set, comprising; marking each such message not matching any other such message in the set as a unique message; marking each such message matching at least one other such message in the set as an exact duplicate message; and marking each such message comprising a subset of at least one other such message in the set as a near duplicate message; encoding at least part of at least one attachment associated with one or more of the messages by generating an attachment sequence for each attachment and storing the attachment sequence for each attachment in the memory; concatenating the metadata sequence and the content sequence for the message and the attachment sequence for the at least one attachment in the memory into a compound sequence; comparing the compound sequences for the messages in the memory; within the memory, marking each exact duplicate message and each near duplicate message having a compound sequence not matching any other compound sequence in the set as a unique message; and outputting the unique, exact duplicate, and near duplicate message markings to the message store. - View Dependent Claims (4, 5)
-
-
6. A set of stored functions for processing a plurality of messages in a message store, comprising:
-
a metadata digester function module configured to generate a metadata signature over at least part of message metadata; a content digester function module configured to generate a content signature over at least part of message content; a comparator function module configured to compare the metadata signatures and the content signatures to identify matches between messages; a categorizer function module configured to group messages into sets based upon the matches of the metadata signatures and the content signatures; a marker function module configured to mark messages in each set with classifications corresponding to unique, exact duplicate, and near duplicate, comprising; a unique marker function module configured to mark each such message not matching any other such message in the set as a unique message; an exact duplicate marker function module configured to mark each such message matching at least one other such message in the set as an exact duplicate message; and a near duplicate marker function module configured to mark each such message comprising a subset of at least one other such message in the set as a near duplicate message; an attachment digester function module configured to generate an attachment signature over at least part of at least one message attachment; a concatenator function module configured to concatenate the metadata signature and the content signature for the message and the attachment signature for the at least one attachment into a compound signature, wherein the comparator function module is further configured to compare the compound signatures to identify matches between messages, and the categorizer function module is further configured to group the messages into sets based upon the matches of the compound signatures by marking each exact duplicate message and each near duplicate message having a compound signature not matching any other compound signature in the set as a unique message; and a processor to execute each of the modules, which are stored on a computer-readable storage medium.
-
-
7. A computer-readable storage medium storing code for processing messages, the code comprising:
-
a structured message store comprising messages, which each comprise; metadata describing at least part of the message; and content contained in the message, wherein one or more of the messages further comprise at least one attachment associated with each of the messages; and discrimination information for each of the messages, comprising; a metadata signature for the metadata of the message; a content signature for the content of each message; an attachment signature for the at least one attachment of the messages having the at least one attachment; and a compound signature concatenated from the metadata signature and the content signature for the message and the attachment signature for the at least one attachment, wherein those messages having metadata signatures matching for only at least part of the metadata and content signatures matching for only at least part of the content of the other messages comprise near duplicate messages, and wherein further those messages having the at least one attachment and having compound signatures not matching the compound signatures of the other messages comprise unique messages. - View Dependent Claims (8)
-
Specification