Model selection for cluster data analysis
First Claim
Patent Images
1. A computer implemented method for clustering data comprising a plurality of letters within text or speech, the method comprising:
- (a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm;
(b) selecting a clustering algorithm based on a dissimilarity measure between pairs of the letters'"'"' principal components;
(c) randomly assigning class labels to the letters;
(d) defining a plurality of clusters of letters within each labeled class;
(e) measuring dissimilarity between each cluster of letters by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation;
(f) reassigning letters to the labeled class with the most similar cluster;
(g) repeating steps (d) through (f) until assignment of letters to the labeled classes remains constant; and
(h) displaying a graph showing the letters clustered into the labeled classes.
3 Assignments
0 Petitions
Accused Products
Abstract
A model selection method is provided for choosing the number of clusters, or more generally the parameters of a clustering algorithm. The algorithm is based on comparing the similarity between pairs of clustering runs on sub-samples or other perturbations of the data. High pairwise similarities show that the clustering represents a stable pattern in the data. The method is applicable to any clustering algorithm, and can also detect lack of structure. We show results on artificial and real data using a hierarchical clustering algorithm.
-
Citations
7 Claims
-
1. A computer implemented method for clustering data comprising a plurality of letters within text or speech, the method comprising:
-
(a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm; (b) selecting a clustering algorithm based on a dissimilarity measure between pairs of the letters'"'"' principal components; (c) randomly assigning class labels to the letters; (d) defining a plurality of clusters of letters within each labeled class; (e) measuring dissimilarity between each cluster of letters by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation; (f) reassigning letters to the labeled class with the most similar cluster; (g) repeating steps (d) through (f) until assignment of letters to the labeled classes remains constant; and (h) displaying a graph showing the letters clustered into the labeled classes. - View Dependent Claims (2)
-
-
3. A computer implemented method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
-
inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm; selecting a plurality of granularity levels k, and for each granularity level k; (a) inducing perturbations in the dataset to generate a modified dataset; (b) applying the clustering algorithm to the at least one modified dataset to produce k clusters under each of the perturbations; (c) creating a data subset comprising the clusters identified in step (b); (d) applying the clustering algorithm to the data subset using the same value of k clusters; (e) determining the stability of the clusterings at each granularity level k by measuring dissimilarity between data in the data subset and the cluster center for the cluster into which the data was assigned; measuring fit of the data to the cluster centers for all k granularity levels, wherein the fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation; selecting from among the plurality of granularity levels an optimum granularity level k corresponding to the best fit; generating an output comprising the dataset clustered into a plurality of subsets corresponding to the optimal granularity level k; and displaying a graph showing the letters of the text or speech clustered into the plurality of subsets. - View Dependent Claims (4)
-
-
5. A computer implemented method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
-
inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm; randomly assigning labels to the letters in the dataset by partitioning the dataset into k subsets, wherein k has a minimum number and a maximum number; for each value of k, beginning with the minimum value, for each pair of subsets, computing a correlation score on the intersection between the pair of subsets, wherein the correlation score comprises a similarity measure between the pair of subsets and the greatest similarity has the highest score; and
displaying a histogram comprising a distribution of the correlation scores for each value of k, wherein the distribution comprising the highest value of k that remains concentrated near the highest correlation score corresponds to a clustering of the letters according to their actual labels. - View Dependent Claims (6, 7)
-
Specification