System and method for identifying and categorizing messages extracted from archived message stores
First Claim
1. A system for identifying messages in a message store, comprising:
- a digester to encode at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message; and
a comparer to group the messages into sets by similar metadata sequences and similar content sequences and to compare the messages in each set, comprising;
a unique marker to mark each such message not matching any other such message in the set as a unique message;
an exact duplicate marker to mark each such message matching at least one other such message in the set as an exact duplicate message; and
a near duplicate marker to mark each such message comprising a subset of at least one other such message in the set as a near duplicate message.
12 Assignments
0 Petitions
Accused Products
Abstract
A system and method for identifying messages in a message store is provided. At least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store are encoded by generating a metadata sequence and a content sequence for each message. The messages are grouped into sets by similar metadata sequences and similar content sequences. The messages in each set are compared. Each such message not matching any other such message in the set is marked as a unique message. Each such message matching at least one other such message in the set is marked as an exact duplicate message. Each such message including a subset of at least one other such message in the set is marked as a near duplicate message.
105 Citations
29 Claims
-
1. A system for identifying messages in a message store, comprising:
-
a digester to encode at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message; and
a comparer to group the messages into sets by similar metadata sequences and similar content sequences and to compare the messages in each set, comprising;
a unique marker to mark each such message not matching any other such message in the set as a unique message;
an exact duplicate marker to mark each such message matching at least one other such message in the set as an exact duplicate message; and
a near duplicate marker to mark each such message comprising a subset of at least one other such message in the set as a near duplicate message. - View Dependent Claims (2, 3)
-
-
4. A method for identifying messages in a message store, comprising:
-
encoding at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message;
grouping the messages into sets by similar metadata sequences and similar content sequences and comparing the messages in each set, comprising;
marking each such message not matching any other such message in the set as a unique message;
marking each such message matching at least one other such message in the set as an exact duplicate message; and
marking each such message comprising a subset of at least one other such message in the set as a near duplicate message. - View Dependent Claims (5, 6, 7)
-
-
8. A system for processing a message store for near duplicate messages, comprising:
-
a deduper to identify near duplicate messages in a message store, comprising;
a comparer to compare digests taken of metadata for and of content contained in each of the messages in the message store; and
a marker to mark each such message having digests not matching the digests any other such message as unique; and
mark each such message having digests matching the digests at least one other such message as an exact duplicate; and
a classifier to group those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages, wherein the marker is further comprised to designate a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate. - View Dependent Claims (9, 10, 11)
-
-
12. A method for processing a message store for near duplicate messages, comprising:
-
identifying near duplicate messages in a message store, comprising;
comparing digests taken of metadata for and of content contained in each of the messages in the message store;
marking each such message having digests not matching the digests any other such message as unique; and
marking each such message having digests matching the digests at least one other such message as an exact duplicate;
grouping those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages; and
designating a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A system for classifying messages, comprising:
-
a metadata signature for metadata describing and a content signature signifying each of a plurality of messages in a message store;
a message deduper to identify those messages having both unique metadata signatures and content signatures as unique messages; and
to identify those messages having both matching metadata signatures and content signatures as exact duplicate messages; and
a processor to process any of the messages remaining unidentified, comprising;
a comparer to compare the contents of those remaining messages having matching metadata signatures; and
a marker to designat each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages. - View Dependent Claims (18, 19)
-
-
20. A method for classifying messages, comprising:
-
forming a metadata signature for metadata describing and a content signature signifying each of a plurality of messages in a message store;
identifying those messages having both unique metadata signatures and content signatures as unique messages;
identifying those messages having both matching metadata signatures and content signatures as exact duplicate messages; and
processing any of the messages remaining unidentified, comprising;
comparing the contents of those remaining messages having matching metadata signatures; and
designating each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages. - View Dependent Claims (21, 22, 23)
-
-
24. A set of stored functions for processing a plurality of messages in a message store, comprising:
-
a metadata digester function to generate a metadata signature over at least part of message metadata;
a content digester function to generate a content signature over at least part of message content;
a comparator function to compare the metadata signatures and the content signatures to identify matches between messages;
a categorizer function to group messages into sets based upon the matches of the metadata signatures and the content signatures; and
a marker function to mark messages in each set with classifications corresponding to unique, exact duplicate, and near duplicate. - View Dependent Claims (25)
-
-
26. Stored messages, comprising:
-
a structured message store comprising messages, which each comprise;
metadata describing at least part of the message; and
content contained in the message;
discrimination information for each of the messages, comprising;
a metadata signature for the metadata of the message; and
a content signature for the content of each message;
wherein those messages having metadata signatures matching for only at least part of the metadata and content signatures matching for only at least part of the content of the other messages comprise near duplicate messages. - View Dependent Claims (27, 28, 29)
-
Specification