System and method for identifying and categorizing messages extracted from archived message stores

US 20060190493A1
Filed: 04/24/2006
Published: 08/24/2006
Est. Priority Date: 03/19/2001
Status: Active Grant

First Claim

Patent Images

1. A system for identifying messages in a message store, comprising:

a digester to encode at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message; and

a comparer to group the messages into sets by similar metadata sequences and similar content sequences and to compare the messages in each set, comprising;

a unique marker to mark each such message not matching any other such message in the set as a unique message;

an exact duplicate marker to mark each such message matching at least one other such message in the set as an exact duplicate message; and

a near duplicate marker to mark each such message comprising a subset of at least one other such message in the set as a near duplicate message.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for identifying messages in a message store is provided. At least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store are encoded by generating a metadata sequence and a content sequence for each message. The messages are grouped into sets by similar metadata sequences and similar content sequences. The messages in each set are compared. Each such message not matching any other such message in the set is marked as a unique message. Each such message matching at least one other such message in the set is marked as an exact duplicate message. Each such message including a subset of at least one other such message in the set is marked as a near duplicate message.

105 Citations

View as Search Results

29 Claims

1. A system for identifying messages in a message store, comprising:
- a digester to encode at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message; and
  
  a comparer to group the messages into sets by similar metadata sequences and similar content sequences and to compare the messages in each set, comprising;
  
  a unique marker to mark each such message not matching any other such message in the set as a unique message;
  
  an exact duplicate marker to mark each such message matching at least one other such message in the set as an exact duplicate message; and
  
  a near duplicate marker to mark each such message comprising a subset of at least one other such message in the set as a near duplicate message.
- View Dependent Claims (2, 3)
- - 2. A system according to claim 1, further comprising:
    - an attachment digester to encode at least part of an attachment associated one or more of the messages by generating an attachment sequence for each attachment;
      
      an attachment comparer to compare the attachment sequences for the messages; and
      
      an attachment marker to each exact duplicate message and each near duplicate message having an attachment sequence not matching any other attachment sequence in the set as a unique message.
  - 3. A system according to claim 1, wherein the metadata sequences and the content sequences comprise hash code digests respectively formed over the metadata associated with and the content contained in each message.

4. A method for identifying messages in a message store, comprising:
- encoding at least part of metadata associated with and at least part of content contained in each of a plurality of messages in a message store by generating a metadata sequence and a content sequence for each message;
  
  grouping the messages into sets by similar metadata sequences and similar content sequences and comparing the messages in each set, comprising;
  
  marking each such message not matching any other such message in the set as a unique message;
  
  marking each such message matching at least one other such message in the set as an exact duplicate message; and
  
  marking each such message comprising a subset of at least one other such message in the set as a near duplicate message.
- View Dependent Claims (5, 6, 7)
- - 5. A method according to claim 4, further comprising:
    - encoding at least part of an attachment associated one or more of the messages by generating an attachment sequence for each attachment;
      
      comparing the attachment sequences for the messages; and
      
      marking each exact duplicate message and each near duplicate message having an attachment sequence not matching any other attachment sequence in the set as a unique message.
  - 6. A method according to claim 4, wherein the metadata sequences and the content sequences comprise hash code digests respectively formed over the metadata associated with and the content contained in each message.
  - 7. A computer-readable storage medium holding code for performing the method according to claim 4.

8. A system for processing a message store for near duplicate messages, comprising:
- a deduper to identify near duplicate messages in a message store, comprising;
  
  a comparer to compare digests taken of metadata for and of content contained in each of the messages in the message store; and
  
  a marker to mark each such message having digests not matching the digests any other such message as unique; and
  
  mark each such message having digests matching the digests at least one other such message as an exact duplicate; and
  
  a classifier to group those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages, wherein the marker is further comprised to designate a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate.
- View Dependent Claims (9, 10, 11)
- - 9. A system according to claim 8, further comprising:
    - a user interface to present the near duplicate messages for user review and selection.
  - 10. A system according to claim 8, wherein one or more of the messages are selected from the group comprising reply messages and forwarded messages.
  - 11. A system according to claim 8, wherein the metadata comprises one or more fields of a header comprised as part of each message.

12. A method for processing a message store for near duplicate messages, comprising:
- identifying near duplicate messages in a message store, comprising;
  
  comparing digests taken of metadata for and of content contained in each of the messages in the message store;
  
  marking each such message having digests not matching the digests any other such message as unique; and
  
  marking each such message having digests matching the digests at least one other such message as an exact duplicate;
  
  grouping those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages; and
  
  designating a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate.
- View Dependent Claims (13, 14, 15, 16)
- - 13. A method according to claim 12, further comprising:
    - presenting the near duplicate messages for user review and selection.
  - 14. A method according to claim 12, wherein one or more of the messages are selected from the group comprising reply messages and forwarded messages.
  - 15. A method according to claim 12, wherein the metadata comprises one or more fields of a header comprised as part of each message.
  - 16. A computer-readable storage medium holding code for performing the process according to claim 12.

17. A system for classifying messages, comprising:
- a metadata signature for metadata describing and a content signature signifying each of a plurality of messages in a message store;
  
  a message deduper to identify those messages having both unique metadata signatures and content signatures as unique messages; and
  
  to identify those messages having both matching metadata signatures and content signatures as exact duplicate messages; and
  
  a processor to process any of the messages remaining unidentified, comprising;
  
  a comparer to compare the contents of those remaining messages having matching metadata signatures; and
  
  a marker to designat each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages.
- View Dependent Claims (18, 19)
- - 18. A system according to claim 17, further comprising:
    - an attachment signature for each attachment comprised with one or more of the message, wherein the deduper is further comprised to identify those exact duplicate messages and those near duplicate messages having unique attachment signatures as exact duplicate messages.
  - 19. A system according to claim 18, further comprising:
    - a digester to generate each of the metadata signatures, content signatures, and attachment signatures as hash codes respectively taken over at least part of the metadata, content, and attachment for the associated message.

20. A method for classifying messages, comprising:
- forming a metadata signature for metadata describing and a content signature signifying each of a plurality of messages in a message store;
  
  identifying those messages having both unique metadata signatures and content signatures as unique messages;
  
  identifying those messages having both matching metadata signatures and content signatures as exact duplicate messages; and
  
  processing any of the messages remaining unidentified, comprising;
  
  comparing the contents of those remaining messages having matching metadata signatures; and
  
  designating each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages.
- View Dependent Claims (21, 22, 23)
- - 21. A method according to claim 20, further comprising:
    - forming an attachment signature for each attachment comprised with one or more of the messages; and
      
      identifying those exact duplicate messages and those near duplicate messages having unique attachment signatures as exact duplicate messages.
  - 22. A method according to claim 21, further comprising:
    - generating each of the metadata signatures, content signatures, and attachment signatures as hash codes respectively taken over at least part of the metadata, content, and attachment for the associated message.
  - 23. A computer-readable storage medium holding code for performing the process according to claim 20.

24. A set of stored functions for processing a plurality of messages in a message store, comprising:
- a metadata digester function to generate a metadata signature over at least part of message metadata;
  
  a content digester function to generate a content signature over at least part of message content;
  
  a comparator function to compare the metadata signatures and the content signatures to identify matches between messages;
  
  a categorizer function to group messages into sets based upon the matches of the metadata signatures and the content signatures; and
  
  a marker function to mark messages in each set with classifications corresponding to unique, exact duplicate, and near duplicate.
- View Dependent Claims (25)
- - 25. A set of stored functions according to claim 24, further comprising:
    - an attachment digester function to generate an attachment signature over at least part of each message attachment, wherein the comparator is further comprised to compare the attachment signatures to identify matches between messages, and the categorizer is further comprised to group the messages into sets based upon the matches of the attachment signatures.

26. Stored messages, comprising:
- a structured message store comprising messages, which each comprise;
  
  metadata describing at least part of the message; and
  
  content contained in the message;
  
  discrimination information for each of the messages, comprising;
  
  a metadata signature for the metadata of the message; and
  
  a content signature for the content of each message;
  
  wherein those messages having metadata signatures matching for only at least part of the metadata and content signatures matching for only at least part of the content of the other messages comprise near duplicate messages.
- View Dependent Claims (27, 28, 29)
- - 27. Stored messages according to claim 26, wherein those messages having metadata signatures not matching for any of the metadata and content signatures not matching for any of the content of the other messages comprise unique messages.
  - 28. Stored messages according to claim 26, wherein those messages having metadata signatures matching for all of the metadata and content signatures matching for all of the content of the other messages comprise exact duplicate messages.
  - 29. Stored messages according to claim 26, wherein one or more of the messages further comprise:
    - one or more attachments associated the message; and
      
      wherein the discrimination information for each of the messages with at least one such attachment further comprises;
      
      an attachment signature for each attachment of the message;
      
      wherein those messages having attachment signatures not matching the attachment signatures of the other messages comprise unique messages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Technology LLC (FTI Consulting Incorporated)
Inventors
Kawai, Kenji, McDonald, David T.

Granted Patent

US 7,577,656 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/1748   De-duplication implemented ...

G06F 16/244   Grouping and aggregation

G06F 16/245   Query processing

G06F 16/26   Visual data mining; Browsin...

G06F 16/285   Clustering or classification

G06F 16/93   Document management systems

G06Q 10/107   Computer-aided management o...

H04L 51/08   Annexed information, e.g. a...

H04L 51/216   Handling conversation histo...

H04L 51/42   Mailbox-related aspects, e....

Y10S 707/99937   Sorting

Y10S 707/99943   Generating database or data...

Y10S 707/99944   Object-oriented database st...

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

System and method for identifying and categorizing messages extracted from archived message stores

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

105 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying and categorizing messages extracted from archived message stores

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

105 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links