Identification of co-regulation patterns by unsupervised cluster analysis of gene expression data
First Claim
1. A computer implemented method for identifying co-regulation patterns within gene expression data comprising gene expression levels, the method comprising:
- (a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm;
(b) selecting a clustering algorithm based on a dissimilarity measure between pairs of principal components of the gene expression levels;
(c) randomly assigning class labels to the gene expression levels;
(d) defining a plurality of clusters of gene expression levels within each labeled class;
(e) measuring dissimilarity between each cluster of gene expression levels by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation;
(f) reassigning gene expression levels to the labeled class with the most similar cluster;
(g) repeating steps (d) through (f) until assignment of gene expression levels to the labeled classes remains constant; and
(h) displaying a graph showing the gene expression levels clustered into the labeled classes, wherein the labeled classes correspond to co-regulation activity.
3 Assignments
0 Petitions
Accused Products
Abstract
A method is provided for unsupervised clustering of gene expression data to identify co-regulation patterns. A clustering algorithm randomly divides the data into k different subsets and measures the similarity between pairs of datapoints within the subsets, assigning a score to the pairs based on similarity, with the greatest similarity giving the highest correlation score. A distribution of the scores is plotted for each k. The highest value of k that has a distribution that remains concentrated near the highest correlation score corresponds to the number of co-regulation patterns.
-
Citations
24 Claims
-
1. A computer implemented method for identifying co-regulation patterns within gene expression data comprising gene expression levels, the method comprising:
-
(a) inputting the data into a computer system having a memory and a processor for executing a clustering algorithm; (b) selecting a clustering algorithm based on a dissimilarity measure between pairs of principal components of the gene expression levels; (c) randomly assigning class labels to the gene expression levels; (d) defining a plurality of clusters of gene expression levels within each labeled class; (e) measuring dissimilarity between each cluster of gene expression levels by measuring a residual of a fit of one cluster onto another cluster, wherein the residual fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation; (f) reassigning gene expression levels to the labeled class with the most similar cluster; (g) repeating steps (d) through (f) until assignment of gene expression levels to the labeled classes remains constant; and (h) displaying a graph showing the gene expression levels clustered into the labeled classes, wherein the labeled classes correspond to co-regulation activity. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer implemented method for identifying co-regulation patterns within a dataset comprising gene expression levels, the method comprising:
-
inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm; selecting a plurality of granularity levels k, and for each granularity level k; (a) inducing perturbations in the dataset to generate a modified dataset; (b) applying the clustering algorithm to the at least one modified dataset to produce k clusters under each of the perturbations; (c) creating a data subset comprising the clusters identified in step (b); (d) applying the clustering algorithm to the data subset using the same value of k clusters; (e) determining the stability of the clusterings at each granularity level k by measuring dissimilarity between data in the data subset and the cluster center for the cluster into which the data was assigned; measuring fit of the data to the cluster centers for all k granularity levels, wherein the fit comprises using a fit that is invariant with respect to affine transformations, wherein the affine transformations comprise a combination of translation, scaling and rotation; selecting from among the plurality of granularity levels an optimum granularity level k corresponding to the best fit; generating an output comprising the dataset clustered into a plurality of subsets corresponding to the optimal granularity level k; and displaying a graph showing the gene expression levels clustered into the plurality of subsets, wherein the subsets correspond to co-regulation activity. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer implemented method for determining co-regulation patterns within a gene expression dataset comprising gene expression levels, the method comprising:
-
inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm; randomly assigning labels to the gene expression levels in the dataset by partitioning the dataset into k subsets, wherein k has a minimum number and a maximum number; for each value of k, beginning with the minimum value, for each pair of subsets, computing a correlation score on the intersection between the pair of subsets, wherein the correlation score comprises a similarity measure between the pair of subsets and the greatest similarity has the highest score; and displaying a histogram comprising a distribution of the correlation scores for each value of k, wherein the distribution comprising the highest value of k that remains concentrated near the highest correlation score corresponds to a clustering of the gene expression levels according to co-regulation activity. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A computer implemented method for identifying co-regulation patterns within a gene expression dataset comprising gene expression levels, the method comprising:
-
inputting the dataset into a computer system having a memory and a processor for executing a clustering algorithm; randomly assigning labels to the gene expression levels in the dataset by partitioning the dataset into k subsamples, wherein k has a minimum number and a maximum number; representing each subsample by a matrix; computing a dot product between pairs of subsamples to generate a correlation score on an intersection between the pair of subsamples, wherein the correlation score comprises a similarity measure between the pair of subsamples and the greatest similarity has the highest score; and generating a display comprising a distribution of the correlation scores for each value of k, wherein the distribution comprising the highest value of k that remains concentrated near the highest correlation score corresponds to a clustering of the gene expression levels according to co-regulation activity. - View Dependent Claims (21, 22, 23, 24)
-
Specification