System and method for identifying unique and duplicate messages
First Claim
Patent Images
1. A system for identifying unique and duplicate messages, comprising:
- a database of messages;
an extractor module to extract a header and a message body from each message;
a parser module to calculate a hash code for each message over at least part of the header and the body of that message and to group the messages having matching hash codes;
a deduper module to randomly select one message in each group with two or more messages as a unique message and to mark the remaining messages in the group as exact duplicate messages;
an attachment parser module to calculate a hash code over at least a portion of an attachment to two or more of the messages; and
a concatenator module to generate a compound hash code for each of the two or more messages by concatenating the hash code for that message and the hash code for the attachment; and
a processor to execute the modules.
8 Assignments
0 Petitions
Accused Products
Abstract
A system and method for identifying unique and duplicate messages is provided. Messages are maintained, and a header and message body are extracted from each of the messages. A hash code is calculated for each message over at least part of the header and the body of that message. The messages with matching hash codes are grouped. One message in each group with two or more messages is randomly selected as a unique message. The remaining messages in the group are marked as exact duplicate messages.
53 Citations
23 Claims
-
1. A system for identifying unique and duplicate messages, comprising:
-
a database of messages; an extractor module to extract a header and a message body from each message; a parser module to calculate a hash code for each message over at least part of the header and the body of that message and to group the messages having matching hash codes; a deduper module to randomly select one message in each group with two or more messages as a unique message and to mark the remaining messages in the group as exact duplicate messages; an attachment parser module to calculate a hash code over at least a portion of an attachment to two or more of the messages; and a concatenator module to generate a compound hash code for each of the two or more messages by concatenating the hash code for that message and the hash code for the attachment; and a processor to execute the modules. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for identifying unique and duplicate messages, comprising the steps of:
-
maintaining messages; extracting a header and a message body from each message; calculating a hash code for each message over at least part of the header and the body of that message; grouping the messages having matching hash codes; randomly selecting one message in each group with two or more messages as a unique message; and marking the remaining messages in the group as exact duplicate messages; calculating a hash code over at least a portion of an attachment to two or more of the messages; and concatenating the hash code for each of the two or more messages with the hash code for the attachment into a compound hash code, wherein the steps are performed by a suitably-programmed computer. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method for identifying unique and duplicate messages, comprising the steps of:
-
maintaining messages; extracting a header and a message body from each message; calculating a hash code for each message over at least part of the header and the body of that message; grouping the messages having matching hash codes; randomly selecting one message in each group with two or more messages as a unique message; and marking the remaining messages in the group as exact duplicate messages; identifying the messages that are not unique and not exact duplicates and grouping the identified messages by conversation thread; and ordering the identified messages in order of length of the message body, wherein the steps are performed by a suitably-programmed computer. - View Dependent Claims (20, 21, 22)
-
-
23. A method for identifying unique and duplicate messages, comprising the steps of:
-
maintaining messages; extracting a header and a message body from each message; calculating a hash code for each message over at least part of the header and the body of that message; grouping the messages having matching hash codes; randomly selecting one message in each group with two or more messages as a unique message; and marking the remaining messages in the group as exact duplicate messages; identifying the messages that are not unique and not exact duplicates and grouping the identified messages by conversation thread; calculating a hash code over at least a portion of an attachment to one or more of the identified messages; ordering the identified messages in order of length of the message body; comparing a longest identified message with each of the other identified messages in the conversation thread; determining that the message body of at least one of the other identified messages is contained within the message body of the longest identified message; comparing the attachment hash codes for the longest identified message and the at least one other identified message; and marking the at least one other identified message as a near duplicate of the longest message when the attachment hash codes match, wherein the steps are performed by a suitably-programmed computer.
-
Specification