System and method for interactive classification and analysis of data
First Claim
1. A method for interactive classification and analysis comprising the steps of:
- generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
generating a count of occurrences of each word in the dictionary within each document in the document set;
partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;
generating a name for each cluster;
generating a centroid of each cluster in the space of the dictionary;
generating a cohesion score for each cluster;
generating a distinctness score for each cluster; and
displaying a table including the name of each cluster and the cohesion score and distinctness score for each cluster.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method, and computer program product for interactively classifying and analyzing data is particularly applicable to classification and analysis of textual data. It is particularly useful in identification of helpdesk inquiry and problem categories amenable to automated fulfillment or solution. A dictionary is generated based on a frequency of occurrence of words in a document set. A count of occurrences of each word in the dictionary within each document in the document set is generated. The set of documents is partitioned into a plurality of clusters. A name, a centroid, a cohesion score, and a distinctness score are generated for each cluster and displayed in a table. The documents contained in the clusters sorted based on their similarity to other documents in the cluster. The similarity may be determined by calculating the distance of the document to the centroid of the cluster and the documents may be sorted in order of ascending or descending distance of the document to the centroid of the cluster. Editing input may be received from a user and the displayed table modified based on the received editing input—clusters may be split or deleted. The helpdesk application area is only one of many areas to which the present invention may be advantageously applied. One of ordinary skill in the art would recognize that any set of text documents may be classified and subsequently analyzed using the present invention.
-
Citations
48 Claims
-
1. A method for interactive classification and analysis comprising the steps of:
-
generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
generating a count of occurrences of each word in the dictionary within each document in the document set;
partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;
generating a name for each cluster;
generating a centroid of each cluster in the space of the dictionary;
generating a cohesion score for each cluster;
generating a distinctness score for each cluster; and
displaying a table including the name of each cluster and the cohesion score and distinctness score for each cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
displaying, for at least one cluster, the documents contained in the at least one cluster, the documents sorted based on their similarity to other documents in the cluster.
-
-
3. The method of claim 2, wherein the similarity of a document to other documents is determined by calculating the distance of the document to the centroid of the cluster.
-
4. The method of claim 3, wherein the documents are sorted in order of descending distance of the document to the centroid of the cluster.
-
5. The method of claim 3, wherein the documents are sorted in order of ascending distance of the document to the centroid of the cluster.
-
6. The method of claim 2, further comprising the steps of:
-
receiving editing input from a user; and
modifying the displayed table based on the received editing input.
-
-
7. The method of claim 6, wherein the editing input comprises an indication of a cluster to be split and the displayed table is modified by splitting the indicated cluster.
-
8. The method of claim 6, wherein the editing input comprises an indication of a cluster to be deleted and the displayed table is modified by deleting the indicated cluster.
-
9. The method of claim 6, wherein the count generating step comprises the step of:
generating a matrix having rows and columns, each column corresponding to a word in the dictionary, each row corresponding to a document, and each entry representing a number of occurrences of the corresponding word in the corresponding document.
-
10. The method of claim 9, wherein the partitioning step comprises the step of:
partitioning the set of documents into a plurality of clusters using a k-means partitioning procedure.
-
11. The method of claim 10, wherein the k-means partitioning procedure comprises the step of:
-
determining a distance between a centroid and a document vector using a distance function of;
wherein X is the centroid, Y is the document vector, and d(X,Y) is the distance between the centroid and the document vector.
-
-
12. The method of claim 9, wherein the step of generating a name for each cluster comprises the step of:
for each cluster, including in the name of the cluster at least one word, the word selected from the dictionary based on a frequency of occurrence in the cluster.
-
13. The method of claim 9, wherein the step of generating a name for each cluster comprises the step of:
for each cluster, including in the name of the cluster a plurality of words, each word selected from the dictionary based on a frequency of occurrence in the cluster.
-
14. The method of claim 9, wherein the centroid generating step comprises the step of:
generating a vector having a plurality of entries, each entry corresponding to a word in the common dictionary and having a value equal to an average of the values of the entries in the matrix corresponding to the word in the common dictionary.
-
15. The method of claim 14, wherein the step of generating a cohesion score for each cluster comprises the step of:
generating a cohesion score based on an average negative cosine distance of the centroid of the cluster to the documents contained in the cluster.
-
16. The method of claim 15, wherein the step of generating a distinctness score for each cluster comprises the step of:
generating a cohesion score based on an average negative distance of the centroid of the cluster to a closest centroid among centroids of other clusters.
-
17. The system of claim 16, wherein the means for sorting the documents based on their similarity to other documents in the cluster comprises means for calculating the distance of the document to the centroid of the cluster.
-
18. The system of claim 17, wherein the sorting means further comprises means for sorting the documents in order of descending distance of the document to the centroid of the cluster.
-
19. The system of claim 17, wherein the sorting means further comprises means for sorting the documents in order of ascending distance of the document to the centroid of the cluster.
-
20. A system for interactive classification and analysis comprising:
-
means for generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
means for generating a count of occurrences of each word in the dictionary within each document in the document set;
means for partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;
means for generating a name for each cluster;
means for generating a centroid of each cluster in the space of the dictionary;
means for generating a cohesion score for each cluster;
means for generating a distinctness score for each cluster; and
means for displaying a table including the name of each cluster and the cohesion score and distinctness score for each cluster. - View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
means for displaying, for at least one cluster, the documents contained in the at least one cluster, comprising means for sorting the documents based on their similarity to other documents in the cluster.
-
-
22. The system of claim 21, further comprising:
-
means for receiving editing input from a user; and
means for modifying the displayed table based on the received editing input.
-
-
23. The system of claim 22, wherein the editing input comprises an indication of a cluster to be split and the displayed table is modified by splitting the indicated cluster.
-
24. The system of claim 22, wherein the editing input comprises an indication of a cluster to be deleted and the displayed table is modified by deleting the indicated cluster.
-
25. The system of claim 22, wherein the count generating means comprises:
generating a matrix having rows and columns, each column corresponding to a word in the dictionary, each row corresponding to a document, and each entry representing a number of occurrences of the corresponding word in the corresponding document.
-
26. The system of claim 25, wherein the partitioning means comprises:
means for partitioning the set of documents into a plurality of clusters using a k-means partitioning procedure.
-
27. The system of claim 26, wherein the k-means partitioning means comprises:
-
means for determining a distance between a centroid and a document vector using a distance function of;
wherein X is the centroid, Y is the document vector, and d(X,Y) is the distance between the centroid and the document vector.
-
-
28. The system of claim 26, wherein the means for generating a name for each cluster comprises:
means for, for each cluster, including in the name of the cluster at least one word, the word selected from the dictionary based on a frequency of occurrence in the cluster.
-
29. The system of claim 26, wherein the means for generating a name for each cluster comprises:
means for, for each cluster, including in the name of the cluster a plurality of words, each word selected from the dictionary based on a frequency of occurrence in the cluster.
-
30. The system of claim 26, wherein the centroid generating means comprises:
means for generating a vector having a plurality of entries, each entry corresponding to a word in the common dictionary and having a value equal to an average of the values of the entries in the matrix corresponding to the word in the common dictionary.
-
31. The system of claim 30, wherein the means for generating a cohesion score for each cluster comprises:
means for generating a cohesion score based on an average negative cosine distance of the centroid of the cluster to the documents contained in the cluster.
-
32. The system of claim 31, wherein the means for generating a distinctness score for each cluster comprises:
means for generating a cohesion score based on an average negative distance of the centroid of the cluster to a closest centroid among centroids of other clusters.
-
33. A computer program product for interactive classification and analysis, comprising:
-
a computer readable medium;
computer program instructions, recorded on the computer readable medium, executable by a processor, for performing the steps of;
generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
generating a count of occurrences of each word in the dictionary within each document in the document set;
partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;
generating a name for each cluster;
generating a centroid of each cluster in the space of the dictionary;
generating a cohesion score for each cluster;
generating a distinctness score for each cluster; and
displaying a table including the name of each cluster and the cohesion score and distinctness score for each cluster. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
displaying, for at least one cluster, the documents contained in the at least one cluster, the documents sorted based on their similarity to other documents in the cluster.
-
-
35. The computer program product of claim 34, wherein the similarity of a document to other documents is determined by calculating the distance of the document to the centroid of the cluster.
-
36. The computer program product of claim 35, wherein the documents are sorted in order of descending distance of the document to the centroid of the cluster.
-
37. The computer program product of claim 35, wherein the documents are sorted in order of ascending distance of the document to the centroid of the cluster.
-
38. The computer program product of claim 34, wherein the computer program instructions further comprise instructions for performing the steps of:
-
receiving editing input from a user; and
modifying the displayed table based on the received editing input.
-
-
39. The computer program product of claim 38, wherein the editing input comprises an indication of a cluster to be split and the displayed table is modified by splitting the indicated cluster.
-
40. The computer program product of claim 38, wherein the editing input comprises an indication of a cluster to be deleted and the displayed table is modified by deleting the indicated cluster.
-
41. The computer program product of claim 38, wherein the count generating step comprises the step of:
generating a matrix having rows and columns, each column corresponding to a word in the dictionary, each row corresponding to a document, and each entry representing a number of occurrences of the corresponding word in the corresponding document.
-
42. The computer program product of claim 41, wherein the partitioning step comprises the step of:
partitioning the set of documents into a plurality of clusters using a k-means partitioning procedure.
-
43. The computer program product of claim 42, wherein the k-means partitioning procedure comprises the step of:
-
determining a distance between a centroid and a document vector using a distance function of;
wherein X is the centroid, Y is the document vector, and d(X,Y) is the distance between the centroid and the document vector.
-
-
44. The computer program product of claim 41, wherein the step of generating a name for each cluster comprises the step of:
for each cluster, including in the name of the cluster at least one word, the word selected from the dictionary based on a frequency of occurrence in the cluster.
-
45. The computer program product of claim 41, wherein the step of generating a name for each cluster comprises the step of:
for each cluster, including in the name of the cluster a plurality of words, each word selected from the dictionary based on a frequency of occurrence in the cluster.
-
46. The computer program product of claim 41, wherein the centroid generating step comprises the step of:
generating a vector having a plurality of entries, each entry corresponding to a word in the common dictionary and having a value equal to an average of the values of the entries in the matrix corresponding to the word in the common dictionary.
-
47. The computer program product of claim 46, wherein the step of generating a cohesion score for each cluster comprises the step of:
generating a cohesion score based on an average negative cosine distance of the centroid of the cluster to the documents contained in the cluster.
-
48. The computer program product of claim 47, wherein the step of generating a distinctness score for each cluster comprises the step of:
generating a cohesion score based on an average negative distance of the centroid of the cluster to a closest centroid among centroids of other clusters.
Specification