MODEL SELECTION FOR CLUSTER DATA ANALYSIS
3 Assignments
0 Petitions
Accused Products
Abstract
A model selection method is provided for choosing the number of clusters, or more generally the parameters of a clustering algorithm. The algorithm is based on comparing the similarity between pairs of clustering runs on sub-samples or other perturbations of the data. High pairwise similarities show that the clustering represents a stable pattern in the data. The method is applicable to any clustering algorithm, and can also detect lack of structure. We show results on artificial and real data using a hierarchical clustering algorithm.
22 Citations
30 Claims
-
1-21. -21. (canceled)
-
22. A method for clustering data comprising a plurality of letters within text or speech, the method comprising:
-
(a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm; (b) selecting a clustering algorithm based on a dissimilarity measure between pairs of the letters'"'"' principal components; (c) randomly assigning class labels to the letters; (d) defining a plurality of clusters of letters within each labeled class; (e) measuring dissimilarity between each cluster of letters by measuring a residual of a fit of one cluster onto another cluster; (f) reassigning letters to the labeled class with the most similar cluster; (g) repeating steps (d) through (f) until assignment of letters to the labeled classes remains constant; and (h) displaying a graph showing the letters clustered into the labeled classes. - View Dependent Claims (23, 24, 25)
-
-
26. A method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
-
inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm; selecting a plurality of granularity levels k, and for each granularity level k; (a) inducing perturbations in the dataset to generate a modified dataset; (b) applying the clustering algorithm to the at least one modified dataset to produce k clusters under each of the perturbations; (c) creating a data subset comprising the clusters identified in step (b); (d) applying the clustering algorithm to the data subset using the same value of k clusters; (e) determining the stability of the clusterings at each granularity level k by measuring dissimilarity between data in the data subset and the cluster center for the cluster into which the data was assigned; measuring fit of the data to the cluster centers for all k granularity levels; selecting from among the plurality of granularity levels an optimum granularity level k corresponding to the best fit; generating an output comprising the dataset clustered into a plurality of subsets corresponding to the optimal granularity level k; and displaying a graph showing the letters of the text or speech clustered into the plurality of subsets. - View Dependent Claims (27)
-
-
28. A method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
-
inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm; randomly assigning labels to the letters in the dataset by partitioning the dataset into k subsets, wherein k has a minimum number and a maximum number; for each value of k, beginning with the minimum value, for each pair of subsets, computing a correlation score on the intersection between the pair of subsets, wherein the correlation score comprises a similarity measure between the pair of subsets and the greatest similarity has the highest score; and displaying a histogram comprising a distribution of the correlation scores for each value of k, wherein the distribution comprising the highest value of k that remains concentrated near the highest correlation score corresponds to a clustering of the letters according to their actual labels. - View Dependent Claims (29, 30)
-
Specification