UNSUPERVISED MESSAGE CLUSTERING
First Claim
Patent Images
1. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for clustering messages, comprising:
- receiving a plurality of messages, each message containing about 250 characters or less;
parsing the messages to form message token vectors for the messages;
filtering the parsed messages to discard at least one message from the plurality of messages;
calculating similarity scores for the filtered plurality of messages relative to one or more message clusters, the message clusters having cluster token vectors, the similarity score being based on the message token vectors and the cluster token vectors, the similarity scores being calculated without normalization of the message token vectors relative to a length of the messages;
adding at least one message to a message cluster based on the at least one message having a similarity score greater than a similarity threshold value; and
updating the cluster token vector for the message cluster containing the added message.
2 Assignments
0 Petitions
Accused Products
Abstract
Unsupervised clustering can be used for organization of micro-blog or other short length messages into message clusters. Messages can be compared with existing clusters to determine a similarity score. If at least one similarity score is greater than a threshold value, a message can be added to an existing message cluster. If a message is not similar to an existing cluster, the message can be compared against criteria for starting a new message cluster.
-
Citations
20 Claims
-
1. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for clustering messages, comprising:
-
receiving a plurality of messages, each message containing about 250 characters or less; parsing the messages to form message token vectors for the messages; filtering the parsed messages to discard at least one message from the plurality of messages; calculating similarity scores for the filtered plurality of messages relative to one or more message clusters, the message clusters having cluster token vectors, the similarity score being based on the message token vectors and the cluster token vectors, the similarity scores being calculated without normalization of the message token vectors relative to a length of the messages; adding at least one message to a message cluster based on the at least one message having a similarity score greater than a similarity threshold value; and updating the cluster token vector for the message cluster containing the added message. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for clustering messages, comprising:
-
receiving a message containing about 250 characters or less; parsing the message to form a message token vector for the message; determining a cluster token vector for a message cluster, the cluster token vector corresponding to a number of tokens that is less than a token threshold value; calculating a similarity score for the message relative to a message cluster based on a product of the cluster token vector and the message token vector, the similarity score being calculated without normalization of the message token vector relative to a length of the message; adding the message to the message cluster based on the similarity score being greater than a similarity threshold value;
updating the cluster token vector based on the addition of the message to the message cluster;matching the updated cluster token vector to a search query; and providing the message cluster in response to the search query. - View Dependent Claims (12, 13, 14)
-
-
15. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for identifying message clusters that are responsive to a search query, comprising:
-
receiving a message containing about 250 characters or less; calculating a plurality of quality feature values for the message, the plurality of quality feature values including two or more of a spam value, a message length value, a reposting value, a link value, and an authority value; adding the message to a message cluster, the message cluster containing one or more additional messages; calculating a cluster ranking for the message cluster based on the quality feature values for messages in the message cluster; determining a cluster token vector for the message cluster, the cluster token vector corresponding to a number of tokens less than a token threshold value; calculating a search ranking for the message cluster relative to a search query, the search ranking for the message cluster being based on at least the cluster ranking of the message cluster and a match ranking of the cluster token vector relative to the search query; and providing the message cluster in response to the search query. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification