UNSUPERVISED MESSAGE CLUSTERING

US 20120239650A1
Filed: 03/18/2011
Published: 09/20/2012
Est. Priority Date: 03/18/2011
Status: Active Grant

First Claim

Patent Images

1. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for clustering messages, comprising:

receiving a plurality of messages, each message containing about 250 characters or less;

parsing the messages to form message token vectors for the messages;

filtering the parsed messages to discard at least one message from the plurality of messages;

calculating similarity scores for the filtered plurality of messages relative to one or more message clusters, the message clusters having cluster token vectors, the similarity score being based on the message token vectors and the cluster token vectors, the similarity scores being calculated without normalization of the message token vectors relative to a length of the messages;

adding at least one message to a message cluster based on the at least one message having a similarity score greater than a similarity threshold value; and

updating the cluster token vector for the message cluster containing the added message.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Unsupervised clustering can be used for organization of micro-blog or other short length messages into message clusters. Messages can be compared with existing clusters to determine a similarity score. If at least one similarity score is greater than a threshold value, a message can be added to an existing message cluster. If a message is not similar to an existing cluster, the message can be compared against criteria for starting a new message cluster.

Citations

20 Claims

1. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for clustering messages, comprising:
- receiving a plurality of messages, each message containing about 250 characters or less;
  
  parsing the messages to form message token vectors for the messages;
  
  filtering the parsed messages to discard at least one message from the plurality of messages;
  
  calculating similarity scores for the filtered plurality of messages relative to one or more message clusters, the message clusters having cluster token vectors, the similarity score being based on the message token vectors and the cluster token vectors, the similarity scores being calculated without normalization of the message token vectors relative to a length of the messages;
  
  adding at least one message to a message cluster based on the at least one message having a similarity score greater than a similarity threshold value; and
  
  updating the cluster token vector for the message cluster containing the added message.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-storage media of claim 1, wherein each message in the plurality of messages contains about 160 characters or less.
  - 3. The computer-storage media of claim 1, wherein the similarity scores are calculated without normalization of the message token vectors relative to a token length of the messages, the token length corresponding to a number of different tokens in a message.
  - 4. The computer-storage media of claim 1, wherein the cluster token vector corresponds to a number of tokens that is less than a token threshold value.
  - 5. The computer-storage media of claim 1, wherein filtering the plurality of messages comprises removing at least one message based on a spam score for the message.
  - 6. The computer-storage media of claim 1, wherein filtering the plurality of messages comprises removing at least one message based on a domain and/or a user identification associated with the message.
  - 7. The computer-storage media of claim 1, further comprising:
    - identifying a message from the filtered plurality of message, the identified message having a similarity score less than the similarity threshold value relative to the one or more message clusters; and
      
      starting a new message cluster using the identified message based on the identified message satisfying one or more criteria for forming a new message cluster.
  - 8. The computer-storage media of claim 7, wherein the one or more criteria for forming a new message cluster include the presence of a link within the identified message and the presence of at least 5 tokens within the identified message.
  - 9. The computer-storage media of claim 1, further comprising removing a message from a second message cluster from the one or more message clusters, the removal of the message being based on a quality value for the removed message, an age of the removed message, or a combination thereof.
  - 10. The computer-storage media of claim 1, further comprising deleting a message cluster from the one or more message clusters, the deletion of the message cluster being based on a cluster ranking for the deleted message cluster, an age of the deleted message cluster, a size of the deleted message cluster relative to a prior size, or a combination thereof.

11. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for clustering messages, comprising:
- receiving a message containing about 250 characters or less;
  
  parsing the message to form a message token vector for the message;
  
  determining a cluster token vector for a message cluster, the cluster token vector corresponding to a number of tokens that is less than a token threshold value;
  
  calculating a similarity score for the message relative to a message cluster based on a product of the cluster token vector and the message token vector, the similarity score being calculated without normalization of the message token vector relative to a length of the message;
  
  adding the message to the message cluster based on the similarity score being greater than a similarity threshold value;
  
  updating the cluster token vector based on the addition of the message to the message cluster;
  
  matching the updated cluster token vector to a search query; and
  
  providing the message cluster in response to the search query.
- View Dependent Claims (12, 13, 14)
- - 12. The computer-storage media of claim 11, wherein each message in the plurality of messages contains about 160 characters or less.
  - 13. The computer-storage media of claim 11, wherein the similarity scores are calculated without normalization of the message token vectors relative to a token length of the messages, the token length corresponding to a number of different tokens in a message.
  - 14. The computer-storage media of claim 11, wherein the cluster token vector is represented by about 12 tokens or less.

15. One or more computer-storage media storing computer-useable instructions that, when executed by a computing device, perform a method for identifying message clusters that are responsive to a search query, comprising:
- receiving a message containing about 250 characters or less;
  
  calculating a plurality of quality feature values for the message, the plurality of quality feature values including two or more of a spam value, a message length value, a reposting value, a link value, and an authority value;
  
  adding the message to a message cluster, the message cluster containing one or more additional messages;
  
  calculating a cluster ranking for the message cluster based on the quality feature values for messages in the message cluster;
  
  determining a cluster token vector for the message cluster, the cluster token vector corresponding to a number of tokens less than a token threshold value;
  
  calculating a search ranking for the message cluster relative to a search query, the search ranking for the message cluster being based on at least the cluster ranking of the message cluster and a match ranking of the cluster token vector relative to the search query; and
  
  providing the message cluster in response to the search query.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-storage media of claim 15, wherein the cluster token vector is represented by about 12 tokens or less.
  - 17. The computer-storage media of claim 15, wherein the plurality of quality feature values include at least the spam value, the message length value, the reposting value, the link value, and the authority value.
  - 18. The computer-storage media of claim 15, wherein the cluster ranking for the message cluster is further based on a number of different users associated with the messages in the cluster.
  - 19. The computer-storage media of claim 15, wherein the cluster ranking for the message cluster is based on averages and/or ratios derived from the quality feature values for the messages in the message cluster.
  - 20. The computer-storage media of claim 15, wherein the cluster ranking for the message cluster is based on a number of messages in the message cluster, an average spam score for messages in the message cluster, an average token length for messages in the message cluster, a ratio of messages that have been reposted and that have not been reposted in the message cluster, a ratio of messages that contain a link and messages that do not contain a link in the message cluster, an average authority score for messages in the cluster, and a ratio of the number of different message authors versus the number of messages in the cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
KIM, KI YEUN, DUAN, LEI, CHUNG, SEOKKYUNG

Granted Patent

US 8,666,984 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 40/30 Semantic analysis

UNSUPERVISED MESSAGE CLUSTERING

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

UNSUPERVISED MESSAGE CLUSTERING

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links