MODEL SELECTION FOR CLUSTER DATA ANALYSIS

US 20080140592A1
Filed: 10/30/2007
Published: 06/12/2008
Est. Priority Date: 05/18/2001
Status: Active Grant

First Claim

Patent Images

1-21. -21. (canceled)

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A model selection method is provided for choosing the number of clusters, or more generally the parameters of a clustering algorithm. The algorithm is based on comparing the similarity between pairs of clustering runs on sub-samples or other perturbations of the data. High pairwise similarities show that the clustering represents a stable pattern in the data. The method is applicable to any clustering algorithm, and can also detect lack of structure. We show results on artificial and real data using a hierarchical clustering algorithm.

22 Citations

View as Search Results

30 Claims

1-21. -21. (canceled)

22. A method for clustering data comprising a plurality of letters within text or speech, the method comprising:
- (a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm;
  
  (b) selecting a clustering algorithm based on a dissimilarity measure between pairs of the letters'"'"' principal components;
  
  (c) randomly assigning class labels to the letters;
  
  (d) defining a plurality of clusters of letters within each labeled class;
  
  (e) measuring dissimilarity between each cluster of letters by measuring a residual of a fit of one cluster onto another cluster;
  
  (f) reassigning letters to the labeled class with the most similar cluster;
  
  (g) repeating steps (d) through (f) until assignment of letters to the labeled classes remains constant; and
  
  (h) displaying a graph showing the letters clustered into the labeled classes.
- View Dependent Claims (23, 24, 25)
- - 23. The method of claim 22, wherein step (e) comprises using a fit that is invariant with respect to affine transformations.
  - 24. The method of claim 23, wherein the affine transformations comprise a combination of translation, scaling and rotation.
  - 25. The method of claim 22 wherein the clustering algorithm is a k-means algorithm.

26. A method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
- inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm;
  
  selecting a plurality of granularity levels k, and for each granularity level k;
  
  (a) inducing perturbations in the dataset to generate a modified dataset;
  
  (b) applying the clustering algorithm to the at least one modified dataset to produce k clusters under each of the perturbations;
  
  (c) creating a data subset comprising the clusters identified in step (b);
  
  (d) applying the clustering algorithm to the data subset using the same value of k clusters;
  
  (e) determining the stability of the clusterings at each granularity level k by measuring dissimilarity between data in the data subset and the cluster center for the cluster into which the data was assigned;
  
  measuring fit of the data to the cluster centers for all k granularity levels;
  
  selecting from among the plurality of granularity levels an optimum granularity level k corresponding to the best fit;
  
  generating an output comprising the dataset clustered into a plurality of subsets corresponding to the optimal granularity level k; and
  
  displaying a graph showing the letters of the text or speech clustered into the plurality of subsets.
- View Dependent Claims (27)
- - 27. The method of claim 26, wherein the perturbations comprise a combination of one or more of sub-sampling the dataset, changing initialization of the clustering algorithm, and adding noise to the dataset.

28. A method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
- inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm;
  
  randomly assigning labels to the letters in the dataset by partitioning the dataset into k subsets, wherein k has a minimum number and a maximum number;
  
  for each value of k, beginning with the minimum value, for each pair of subsets, computing a correlation score on the intersection between the pair of subsets, wherein the correlation score comprises a similarity measure between the pair of subsets and the greatest similarity has the highest score; and
  
  displaying a histogram comprising a distribution of the correlation scores for each value of k, wherein the distribution comprising the highest value of k that remains concentrated near the highest correlation score corresponds to a clustering of the letters according to their actual labels.
- View Dependent Claims (29, 30)
- - 29. The method of claim 28, wherein the step of computing the correlation score comprises selecting a fraction of the letters in each subset for comparison with other subsets.
  - 30. The method of claim 28, wherein the fraction is greater than 0.5.

Specification

Resources

Litigation Campaign Assessment

Granted Patent

US 7,890,445 B2
Time in Patent Office

Days
Field of Search
US Class Current

706/12
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 18/23   Clustering techniques

G16B 25/00   ICT specially adapted for h...

G16B 25/10   Gene or protein expression ...

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16B 40/30   Unsupervised data analysis

MODEL SELECTION FOR CLUSTER DATA ANALYSIS

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

22 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

MODEL SELECTION FOR CLUSTER DATA ANALYSIS

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links