Computer-implemented system and method for identifying near duplicate messages
First Claim
1. A computer-implemented system for identifying near duplicate messages, comprising:
- a processor coupled to a memory to execute the following modules comprising;
a message grouping module to group by conversation thread, messages each comprising a content body, wherein one or more of the messages also includes an attachment;
a message sorting module to sort the messages for each conversation thread in order of message length;
a message selection module to select for one of the threads at least one of the messages and to compare the body of the selected message with the body of one such shorter message in that thread;
a determination module to determine that the body of the shorter message is included in the body of the selected message;
a message relationship module to determine a relationship between the selected message and the shorter message by marking the shorter message as a near duplicate of the selected message if the selected message and the shorter message do not have attachments and by comparing hash codes of the attachments for the selected message and the shorter message, if the selected message and the shorter message each have attachments, and marking the shorter message as a near duplicate message of the selected message when the hash codes of the attachments match.
6 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented system and method for identifying near duplicate messages is provided. Messages each including a content body are grouped by conversation thread. One or more of the messages also includes an attachment. The messages for each conversation thread are sorted in order of message length. At least one of the messages is selected from one of the threads and the body of the selected message is compared with the body of one such shorter message in that thread. A determination is made that the body of the shorter message is included in the body of the selected message. Hash codes of the attachments for the selected message and the shorter message are compared. The shorter message is marked as a near duplicate message of the selected message when the hash codes of the attachments match.
51 Citations
18 Claims
-
1. A computer-implemented system for identifying near duplicate messages, comprising:
a processor coupled to a memory to execute the following modules comprising; a message grouping module to group by conversation thread, messages each comprising a content body, wherein one or more of the messages also includes an attachment; a message sorting module to sort the messages for each conversation thread in order of message length; a message selection module to select for one of the threads at least one of the messages and to compare the body of the selected message with the body of one such shorter message in that thread; a determination module to determine that the body of the shorter message is included in the body of the selected message; a message relationship module to determine a relationship between the selected message and the shorter message by marking the shorter message as a near duplicate of the selected message if the selected message and the shorter message do not have attachments and by comparing hash codes of the attachments for the selected message and the shorter message, if the selected message and the shorter message each have attachments, and marking the shorter message as a near duplicate message of the selected message when the hash codes of the attachments match. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
10. A method implemented by a computer comprising at least one processor for identifying near duplicate messages, comprising:
-
grouping by conversation thread, messages each comprising a content body, wherein one or more of the messages also includes an attachment; sorting the messages for each conversation thread in order of message length; for one of the threads, selecting at least one of the messages and comparing the body of the selected message with the body of one such shorter message in that thread; determining that the body of the shorter message is included in the body of the selected message; determining a relationship between the selected message and the shorter message comprising at least one of; if the selected message and the shorter message do not have attachments, marking the shorter message as a near duplicate of the selected message; and if the selected message and the shorter message each have attachments, comparing hash codes of the attachments for the selected message and the shorter message and marking the shorter message as a near duplicate message of the selected message when the hash codes of the attachments match. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
Specification