Method and apparatus for normalizing quoting styles in electronic mail messages
First Claim
1. A system for normalizing quoting styles, comprising:
- a storage comprising a plurality of messages, which are each partitioned into quoted and unquoted text, wherein the messages are structured as nodes of a hierarchical tree;
a node processor, comprising;
a message pairing module to identify a pair of the messages, which are parent and child messages;
a text removal module to remove any of the quoted text that comprises the parent message in full from the child message; and
a text addition module to add at least part of the parent message into the child message based on part of any of the quoted text in the child message being from and comprising an issue addressed by the parent message;
a tree traversal module to process each of the messages by traversing the hierarchical tree;
a vector module to obtain word vectors for each of the nodes corresponding to messages comprising at least one of quoted and unquoted text;
a node pairing module to identify pairs of the nodes, comprising one of a parent-child and sibling relationship;
a distance module to determine a lexical distance between each pair of the nodes; and
a clustering module to generate primary clusters from those pairs of the nodes having lexical distances that indicate closely-related messages.
2 Assignments
0 Petitions
Accused Products
Abstract
In the context of applications such as finding messages dealing with a particular topic, or finding inter-conversation topic groupings via centroid-based clustering methods, the essential text of a first message is adjusted to avoid vector distance distortions based on differences in quoting styles. Text is deleted from the first message if that text constitutes an entire prefixed or suffixed second message (typically a parent message), while selective quotes in the first message are included in the adjusted message because these are considered to form a logical pan of the message. When the first text does not contain any quoting portions of the second text, an analysis is done to determine whether all or part of a second text constitutes a logical reference to the first message. If so, all or some parts of the essential text of the second (parent) message are included in the adjusted message.
60 Citations
13 Claims
-
1. A system for normalizing quoting styles, comprising:
-
a storage comprising a plurality of messages, which are each partitioned into quoted and unquoted text, wherein the messages are structured as nodes of a hierarchical tree; a node processor, comprising; a message pairing module to identify a pair of the messages, which are parent and child messages; a text removal module to remove any of the quoted text that comprises the parent message in full from the child message; and a text addition module to add at least part of the parent message into the child message based on part of any of the quoted text in the child message being from and comprising an issue addressed by the parent message; a tree traversal module to process each of the messages by traversing the hierarchical tree; a vector module to obtain word vectors for each of the nodes corresponding to messages comprising at least one of quoted and unquoted text; a node pairing module to identify pairs of the nodes, comprising one of a parent-child and sibling relationship; a distance module to determine a lexical distance between each pair of the nodes; and a clustering module to generate primary clusters from those pairs of the nodes having lexical distances that indicate closely-related messages. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for normalizing quoting styles, comprising:
-
partitioning each of a plurality of messages into quoted and unquoted text; identifying a pair of the messages, which are parent and child messages; removing any of the quoted text that comprises the parent message in full from the child message; adding at least part of the parent message into the child message based on part of any of the quoted text in the child message being from and comprising an issue addressed by the parent message; structuring the messages as nodes of a hierarchical tree; processing each of the messages by traversing the hierarchical tree; obtaining word vectors for each of the nodes corresponding to messages comprising at least one of quoted and unquoted text; identifying pairs of the nodes, comprising one of a parent-child and sibling relationship; determining a lexical distance between each pair of the nodes; and generating primary clusters from those pairs of the nodes having lexical distances that indicate closely-related messages. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
Specification