Model selection for cluster data analysis

US 7,890,445 B2
Filed: 10/30/2007
Issued: 02/15/2011
Est. Priority Date: 05/18/2001
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented method for clustering data comprising a plurality of letters within text or speech, the method comprising:

(a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm;

(b) selecting a clustering algorithm based on a dissimilarity measure between pairs of the letters'"'"' principal components;

(c) randomly assigning class labels to the letters;

(d) defining a plurality of clusters of letters within each labeled class;

(e) measuring dissimilarity between each cluster of letters by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation;

(f) reassigning letters to the labeled class with the most similar cluster;

(g) repeating steps (d) through (f) until assignment of letters to the labeled classes remains constant; and

(h) displaying a graph showing the letters clustered into the labeled classes.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A model selection method is provided for choosing the number of clusters, or more generally the parameters of a clustering algorithm. The algorithm is based on comparing the similarity between pairs of clustering runs on sub-samples or other perturbations of the data. High pairwise similarities show that the clustering represents a stable pattern in the data. The method is applicable to any clustering algorithm, and can also detect lack of structure. We show results on artificial and real data using a hierarchical clustering algorithm.

Citations

7 Claims

1. A computer implemented method for clustering data comprising a plurality of letters within text or speech, the method comprising:
- (a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm;
  
  (b) selecting a clustering algorithm based on a dissimilarity measure between pairs of the letters'"'"' principal components;
  
  (c) randomly assigning class labels to the letters;
  
  (d) defining a plurality of clusters of letters within each labeled class;
  
  (e) measuring dissimilarity between each cluster of letters by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation;
  
  (f) reassigning letters to the labeled class with the most similar cluster;
  
  (g) repeating steps (d) through (f) until assignment of letters to the labeled classes remains constant; and
  
  (h) displaying a graph showing the letters clustered into the labeled classes.
- View Dependent Claims (2)
- - 2. The method of claim 1 wherein the clustering algorithm is a k-means algorithm.

3. A computer implemented method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
- inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm;
  
  selecting a plurality of granularity levels k, and for each granularity level k;
  
  (a) inducing perturbations in the dataset to generate a modified dataset;
  
  (b) applying the clustering algorithm to the at least one modified dataset to produce k clusters under each of the perturbations;
  
  (c) creating a data subset comprising the clusters identified in step (b);
  
  (d) applying the clustering algorithm to the data subset using the same value of k clusters;
  
  (e) determining the stability of the clusterings at each granularity level k by measuring dissimilarity between data in the data subset and the cluster center for the cluster into which the data was assigned;
  
  measuring fit of the data to the cluster centers for all k granularity levels, wherein the fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation;
  
  selecting from among the plurality of granularity levels an optimum granularity level k corresponding to the best fit;
  
  generating an output comprising the dataset clustered into a plurality of subsets corresponding to the optimal granularity level k; and
  
  displaying a graph showing the letters of the text or speech clustered into the plurality of subsets.
- View Dependent Claims (4)
- - 4. The method of claim 3, wherein the perturbations comprise a combination of one or more of sub-sampling the dataset, changing initialization of the clustering algorithm, and adding noise to the dataset.

5. A computer implemented method for clustering patterns in a dataset comprising letters in text or speech, the method comprising:
- inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm;
  
  randomly assigning labels to the letters in the dataset by partitioning the dataset into k subsets, wherein k has a minimum number and a maximum number;
  
  for each value of k, beginning with the minimum value, for each pair of subsets,computing a correlation score on the intersection between the pair of subsets, wherein the correlation score comprises a similarity measure between the pair of subsets and the greatest similarity has the highest score; and
  
  displaying a histogram comprising a distribution of the correlation scores for each value of k, wherein the distribution comprising the highest value of k that remains concentrated near the highest correlation score corresponds to a clustering of the letters according to their actual labels.
- View Dependent Claims (6, 7)
- - 6. The method of claim 5, wherein the step of computing the correlation score comprises selecting a fraction of the letters in each subset for comparison with other subsets.
  - 7. The method of claim 5, wherein the fraction is greater than 0.5.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Curtis Anderson, Health Discovery Corporation, James Roberts, Joe Mckenzie, Jules B. Paderewski, Julian N. Stern, Memorial Health Systems Incorporated, Timothy P. O'Hayer
Original Assignee
Health Discovery Corporation
Inventors
Ben Hur, Asa, Guyon, Isabelle, Elisseeff, André
Primary Examiner(s)
Sparks; Donald
Assistant Examiner(s)
Bharadwaj; Kalpana

Application Number

US11/929,522
Publication Number

US 20080140592A1
Time in Patent Office

1,204 Days
Field of Search

382/225, 702/19
US Class Current

706/45
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 18/23   Clustering techniques

G16B 25/00   ICT specially adapted for h...

G16B 25/10   Gene or protein expression ...

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16B 40/30   Unsupervised data analysis

Model selection for cluster data analysis

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Model selection for cluster data analysis

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links