Method and system for data mining of short message streams
First Claim
1. A computer-implemented method for summarizing a message stream, method comprising the steps of:
- defining a communications channel with one or more key words, wherein defining the communications channel comprises specifying one or more key words that are used to extract a message from the message stream, the message stream comprising at least two messages;
extracting one or more messages from the message stream based on the defined channel, wherein extracting one or more messages from the message stream based on the defined channel comprises filtering one or more messages from the message stream using the defined channel as a filter for selecting a message to be extracted for additional processing;
removing common words from the one or more extracted messages;
building a word order graph for the one or more extracted messages, the word order graph tracking sequencing of words found within each extracted message;
using an algorithm to find commonly occurring word clusters within each extracted message, wherein the algorithm reviews each extracted message for at least two-word clusters with a predetermined pair-frequency, the pair-frequency comprising a number of times that words appear together in an extracted message;
pruning the word clusters to reduce a total number of word clusters;
ranking one or more surviving clusters to determine an order of presentation;
arranging each word cluster into a natural order based on the word order graph; and
displaying the word clusters as a summary of the message stream.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and system for summarizing messages from a message stream is disclosed in which association analysis is applied to stream of short data messages comprising words in a spoken language, such as English. Clusters of words are identified that provide a summary of the several conversations (short data messages originating from different human sources) that are imbedded in the message stream. Each word cluster may represent a set of messages that are its instances. The word clusters may collectively constitute a summary of the entire message stream. The word clusters that have been extracted from message stream may also be grouped into topics. Also, an identity of one or more message originators may be listed based on their influence on the messages being analyzed. The short data messages may also be sorted based on a geographical location of one or more originators of messages.
44 Citations
20 Claims
-
1. A computer-implemented method for summarizing a message stream, method comprising the steps of:
-
defining a communications channel with one or more key words, wherein defining the communications channel comprises specifying one or more key words that are used to extract a message from the message stream, the message stream comprising at least two messages; extracting one or more messages from the message stream based on the defined channel, wherein extracting one or more messages from the message stream based on the defined channel comprises filtering one or more messages from the message stream using the defined channel as a filter for selecting a message to be extracted for additional processing; removing common words from the one or more extracted messages; building a word order graph for the one or more extracted messages, the word order graph tracking sequencing of words found within each extracted message; using an algorithm to find commonly occurring word clusters within each extracted message, wherein the algorithm reviews each extracted message for at least two-word clusters with a predetermined pair-frequency, the pair-frequency comprising a number of times that words appear together in an extracted message; pruning the word clusters to reduce a total number of word clusters; ranking one or more surviving clusters to determine an order of presentation; arranging each word cluster into a natural order based on the word order graph; and displaying the word clusters as a summary of the message stream. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer-implemented system comprising:
-
means for defining a communications channel with one or more key words, wherein the means for defining the communications channel with one or more key words comprises means for specifying one or more key words that are used to extract a message from the message stream, the message stream comprising at least two messages; means for extracting one or more messages from the message stream based on the defined channel, wherein the means for extracting the one or more messages from the message stream based on the defined channel comprises means for filtering one or more messages from the message stream using the defined channel as a filter for selecting a message to be extracted for additional processing; means for removing common words from the one or more extracted messages; means for building a word order graph for the one or more extracted messages, the word order graph tracking sequencing of words found within each extracted message; means for using an algorithm to find commonly occurring word clusters within each extracted message, wherein the algorithm reviews each extracted message for at least two-word clusters with a predetermined pair-frequency, the pair-frequency comprising a number of times that words appear together in an extracted message; means for pruning the word clusters to reduce a total number of word clusters; means for ranking one or more surviving clusters to determine an order of presentation; means for arranging each word cluster into a natural order based on the word order graph; and means for displaying the word clusters as a summary of the message stream. - View Dependent Claims (12, 13, 16)
-
- 14. The computer-implemented method of system 11, further comprising means for defining a communications channel based on a geographical location of one or more originators of messages.
-
17. A computer program product comprising a tangible computer usable medium having a computer readable program code embodied therein, said tangible computer readable program code adapted to be executed to implement a method for summarizing a message stream, said method comprising:
-
defining a communications channel with one or more key words, wherein defining the communications channel comprises specifying one or more key words that are used to extract a message from the message stream, the message stream comprising at least two messages; extracting one or more messages from the message stream based on the defined channel, wherein extracting one or more messages from the message stream based on the defined channel comprises filtering one or more messages from the message stream using the defined channel as a filter for selecting a message to be extracted for additional processing; removing common words from the one or more extracted messages; building a word order graph for the one or more extracted messages, the word order graph tracking sequencing of words found within each extracted message; using an algorithm to find commonly occurring word clusters within each extracted message, wherein the algorithm reviews each extracted message for at least two-word clusters with a predetermined pair-frequency, the pair-frequency comprising a number of times that words appear together in an extracted message; pruning the word clusters to reduce their size a total number of word clusters; ranking one or more surviving clusters to determine an order of presentation; arranging each word cluster into a natural order based on the word order graph; and displaying the word clusters as a summary of the message stream. - View Dependent Claims (18, 19, 20)
-
Specification