System and method for processing a message store for near duplicate messages

US 7,836,054 B2
Filed: 08/17/2009
Issued: 11/16/2010
Est. Priority Date: 03/19/2001
Status: Expired due to Term

First Claim

Patent Images

1. A system for processing a message store for near duplicate messages, comprising:

a deduper module configured to identify near duplicate messages in a plurality of messages in a message store, comprising;

a message digester module configured to generate a message sequence taken from metadata for, and of content contained in, each of the messages and to generate an attachment sequence for at least part of at least one attachment associated with one or more of the messages;

a concatenator module configured to concatenate the message sequence and the attachment sequence into a compound digest for each message;

a comparer module configured to compare the message sequences and the compound digests of the messages in the message store; and

a message sequence marker module configured to mark each such message having a message sequence not matching the message sequence of any other such message as unique; and

mark each such message having a message sequence matching the message sequence of at least one other such message as an exact duplicate; and

a classifier module configured to group those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages, wherein the marker is further configured to designate a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate;

a compound sequence maker module to mark each exact duplicate message and each near duplicate message having a compound digest not matching any other compound digest as a unique message; and

a processor to execute each of the modules, which are stored on a computer-readable storage medium.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for processing a message store for near duplicate messages is provided. Metadata, content, and each attachment associated with messages are extracted. Near duplicate messages in the message store are identified. Compound digests taken of the metadata for, of the content contained in, and of the each attachment associated with each of the messages in the message store are compared. Each message having a compound digest not matching the compound digest of any other message is marked as unique and each message having a compound digest matching the compound digest of at least one other message is marked as an exact duplicate. Messages remaining unmarked and having similar content are grouped into sets that each includes one or more near duplicate messages. One of the near duplicate messages is designated as unique and each remaining near duplicate message in the set is designated as a near duplicate.

Citations

20 Claims

1. A system for processing a message store for near duplicate messages, comprising:
- a deduper module configured to identify near duplicate messages in a plurality of messages in a message store, comprising;
  
  a message digester module configured to generate a message sequence taken from metadata for, and of content contained in, each of the messages and to generate an attachment sequence for at least part of at least one attachment associated with one or more of the messages;
  
  a concatenator module configured to concatenate the message sequence and the attachment sequence into a compound digest for each message;
  
  a comparer module configured to compare the message sequences and the compound digests of the messages in the message store; and
  
  a message sequence marker module configured to mark each such message having a message sequence not matching the message sequence of any other such message as unique; and
  
  mark each such message having a message sequence matching the message sequence of at least one other such message as an exact duplicate; and
  
  a classifier module configured to group those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages, wherein the marker is further configured to designate a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate;
  
  a compound sequence maker module to mark each exact duplicate message and each near duplicate message having a compound digest not matching any other compound digest as a unique message; and
  
  a processor to execute each of the modules, which are stored on a computer-readable storage medium.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A system according to claim 1, wherein themessage sequences and the attachment sequences comprise substantially unique hash codes.
  - 3. A system according to claim 1, further comprising:
    - a user interface to present the near duplicate messages for user review and selection.
  - 4. A system according to claim 1, wherein one or more of the messages are selected from the group comprising reply messages and forwarded messages.
  - 5. A system according to claim 1, wherein the metadata comprises one or more fields of a header comprised as part of each message.
  - 6. A system according to claim 1, wherein the compound digest is calculated using a one-way function and comprises alphanumeric, numeric, and alphabetic character strings.

7. A computer-implemented method for processing a message store for near duplicate messages, comprising:
- maintaining a plurality of messages in a message store;
  
  retrieving the messages from the message store and extracting metadata for, content contained in, and each attachment associated with one or more of the messages in the message store into a memory;
  
  generating a message sequence taken from the metadata for, and the content contained in, each of the messages and generating an attachment sequence for at least part of each attachment associated with the one or more of the messages;
  
  concatenating the message sequence and the attachment sequence into a compound digest for each message;
  
  identifying near duplicate messages in the message store, comprising;
  
  comparing the message sequences and the compound digests of the messages in the message store;
  
  marking each such message having a message sequence not matching the message sequence of any other such message as unique; and
  
  marking each such message having a message sequence matching the message sequence of at least one other such message as an exact duplicate;
  
  grouping those messages remaining unmarked and having similar content into sets that each comprise one or more near duplicate messages;
  
  designating a first of the near duplicate messages in each of the sets as unique and each remaining near duplicate message in the set as a near duplicate;
  
  marking each exact duplicate message and each near duplicate message having a compound digest not matching any other compound digest as a unique message; and
  
  outputting the unique, exact duplicate, and near duplicate message markings to the message store.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. A computer-implemented method according to claim 7, further comprising:
    - wherein themessage sequences and the attachment sequences comprise substantially unique hash codes.
  - 9. A computer-implemented method according to claim 7, further comprising:
    - presenting the near duplicate messages for user review and selection.
  - 10. A computer-implemented method according to claim 7, wherein one or more of the messages are selected from the group comprising reply messages and forwarded messages.
  - 11. A computer-implemented method according to claim 7, wherein the metadata comprises one or more fields of a header comprised as part of each message.
  - 12. A computer-implemented method according to claim 7, wherein the compound digest is calculated using a one-way function and comprises alphanumeric, numeric, and alphabetic character strings.
  - 13. A computer-readable storage medium holding code for performing the process according to claim 7.

14. A system for classifying messages, comprising:
- a metadata signature for metadata describing and a content signature signifying each of a plurality of messages in a message store;
  
  a message deduper module configured to identify those messages having both unique metadata signatures and content signatures as unique messages; and
  
  to identify those messages having both matching metadata signatures and content signatures as exact duplicate messages;
  
  a message processor module configured to process any of the messages remaining unidentified, comprising;
  
  a comparer module configured to compare the contents of those remaining messages having matching metadata signatures; and
  
  a marker module configured to designate each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages;
  
  an attachment signature for each attachment comprised with one or more of the message;
  
  a compound signature comprising a concatenation of the metadata signature, content signature, and the attachment signature for each attachment comprised with each message, wherein the deduper module is further configured to identify those exact duplicate messages and those near duplicate messages having unique compound signatures as unique messages; and
  
  a processor to execute each of the modules, which are stored on a computer-readable storage medium.
- View Dependent Claims (15, 16)
- - 15. A system according to claim 14, further comprising:
    - a digester module configured to generate each of the metadata signatures, content signatures, and attachment signatures as hash codes.
  - 16. A system according to claim 14, further comprising:
    - a shadow store to store the non-unique messages.

17. A computer-implemented method for classifying messages, comprising:
- maintaining a plurality of messages in a message store;
  
  retrieving the messages from the message store into a memory;
  
  forming a metadata signature for the metadata describing and the content signature signifying each of the plurality of messages in the message store;
  
  identifying those messages having both unique metadata signatures and content signatures as unique messages;
  
  identifying those messages having both matching metadata signatures and content signatures as exact duplicate messages;
  
  processing any of the messages remaining unidentified, comprising;
  
  comparing the contents of those remaining messages having matching metadata signatures; and
  
  designating each of the matching messages having unique content as unique messages and any remaining messages as near duplicate messages;
  
  forming an attachment signature for each attachment comprised with one or more of the messages;
  
  forming a compound signature comprising a concatenation of the metadata signature, content signature, and the attachment signature for each attachment comprised with each message;
  
  identifying those exact duplicate messages and those near duplicate messages having unique compound signatures as unique messages; and
  
  outputting the unique, exact duplicate, and near duplicate message identifications to the message store.
- View Dependent Claims (18, 19, 20)
- - 18. A computer-implemented method according to claim 17, further comprising:
    - generating each of the metadata signatures, content signatures, and attachment signatures as hash codes.
  - 19. A computer-implemented method according to claim 17, further comprising:
    - maintaining a shadow store to store the non-unique messages.
  - 20. A computer-readable storage medium holding code for performing the process according to claim 17.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Technology LLC (FTI Consulting Incorporated)
Inventors
Kawai, Kenji, McDonald, David T.
Primary Examiner(s)
Fleurantin; Jean B.
Assistant Examiner(s)
NGUYEN, PHONG H

Application Number

US12/542,581
Publication Number

US 20090307630A1
Time in Patent Office

456 Days
Field of Search

707/7, 707/102, 707/103.R, 707/104.1, 707/626, 707/736, 707/737, 707/822
US Class Current

707/737
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 16/1748   De-duplication implemented ...

G06F 16/244   Grouping and aggregation

G06F 16/245   Query processing

G06F 16/26   Visual data mining; Browsin...

G06F 16/285   Clustering or classification

G06F 16/93   Document management systems

G06Q 10/107   Computer-aided management o...

H04L 51/08   Annexed information, e.g. a...

H04L 51/216   Handling conversation histo...

H04L 51/42   Mailbox-related aspects, e....

Y10S 707/99937   Sorting

Y10S 707/99943   Generating database or data...

Y10S 707/99944   Object-oriented database st...

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

System and method for processing a message store for near duplicate messages

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for processing a message store for near duplicate messages

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links