Significance analysis of microarrays
First Claim
1. A method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
- providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets;
deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
2 Assignments
0 Petitions
Accused Products
Abstract
Microarrays can measure the expression of thousands of genes and thus identify changes in expression between different biological states. Methods are needed to determine the significance of these changes, while accounting for the enormous number of genes. We describe a new method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene based on the change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of such genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared to FDRs of 60% and 84% using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation, and 3 in apoptosis. Surprisingly, 4 nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a heretofore unrecognized role in repairing DNA damaged by ionizing radiation.
33 Citations
64 Claims
-
1. A method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
-
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets;
deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 29, 30, 31, 32, 33, 36, 38, 40, 41, 42, 43)
-
-
23. A method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said method comprising:
-
defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair;
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
-
-
28. A method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
-
calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets;
ranking the values of the parameter of the genes;
providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and
comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
-
-
34. A method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
-
defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets;
providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values;
deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values;
finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and
comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
-
-
35. A method for reducing statistical error of a set of associated values of genes, wherein the method comprises:
-
providing a set of associated values of each gene; and
processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
-
-
37. A method for comparing sets of associated values of genes, which comprises:
-
providing sets of associated values of each gene;
processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and
comparing representative values for that gene for the sets.
-
-
39. A method for comparing a first and a second set of associated values of genes, which comprises:
-
providing odd root values of the values in the first set, and odd root values of the values in the second set; and
comparing the odd root values of the values in the first set and the odd root values of the values in the second sets.
-
-
44. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
-
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets;
deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
-
-
45. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said method comprising:
-
defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair;
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
-
-
46. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
-
calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets;
ranking the values of the parameter of the genes;
providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and
comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
-
-
47. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
-
defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets;
providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values;
deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values;
finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and
comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
-
-
48. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for reducing statistical error of a set of associated values of genes, wherein the method comprises:
-
providing a set of associated values of each gene; and
processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
-
-
49. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for comparing sets of associated values of genes, which comprises:
-
providing sets of associated values of each gene;
processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and
comparing representative values for that gene for the sets.
-
-
50. A computer readable storage device embodying a program of instructions executable by a computer to perform a method for comparing a first and a second set of associated values of genes, which comprises:
-
providing odd root values of the values in the first set, and odd root values of the values in the second set; and
comparing the odd root values of the values in the first set and the odd root values of the values in the second sets.
-
-
51. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process;
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets;
deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
-
52. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said method comprising:
causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process;
defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair;
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
-
53. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process;
calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets;
ranking the values of the parameter of the genes;
providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and
comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
-
54. A method for transmitting a program of instructions executable by a computer to perform a method for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the method comprises:
causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process;
defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets;
providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values;
deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values;
finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and
comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
-
55. A method for transmitting a program of instructions executable by a computer to perform a method for reducing statistical error of a set of associated values of genes, wherein the method comprises:
causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process;
providing a set of associated values of each gene; and
processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
-
56. A method for transmitting a program of instructions executable by a computer to perform a method for comparing sets of associated values of genes, which comprises:
causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process;
providing sets of associated values of each gene;
processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and
comparing representative values for that gene for the sets.
-
57. A method for transmitting a program of instructions executable by a computer to perform a method for comparing a first and a second set of associated values of genes, which comprises:
causing a program of instructions to be transmitted to a client device, thereby enabling the client device to perform, by means of such program, the following process;
providing odd root values of the values in the first set, and odd root values of the values in the second set; and
comparing the odd root values of the values in the first set and the odd root values of the values in the second sets.
-
58. A computer system for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the system comprises:
-
one or more computers;
one or more computer programs running on the computer(s), performing the following;
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
adjusting the parameters of the plurality of genes so that the parameters are substantially independent of scatter values or average associated values of the genes over the sets;
deriving an observed value and an expected value of the adjusted parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance among the sets.
-
-
59. A computer system for analyzing a plurality of sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein the associated values correlate with patient survival time, and wherein the associated values of the genes are obtained from a number of data sources, said system comprising:
-
one or more computers;
one or more computer programs running on the computer(s), performing the following;
defining pairs of death and risk sets, each pair having a corresponding patient death time, where the death set of such pair includes associated values corresponding to the death time of such pair and the risk set of such pair includes associated values corresponding to times occurring after the death time of such pair;
providing for each of the plurality of genes a parameter that contains information concerning differences in the associated values of that gene among the sets;
deriving an observed value and an expected value of the parameter for each gene from the sets of associated values; and
comparing the observed and expected values of the parameter to identify genes whose associated values differ by an amount of statistical significance.
-
-
60. A computer system for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the system comprises:
-
one or more computers;
one or more computer programs running on the computer(s), performing the following;
calculating for each gene a value for a statistical parameter indicating differences between associated values of such gene among the original sets;
ranking the values of the parameter of the genes;
providing an expected value of such parameter for each rank, wherein said providing includes permuting the associated values in the original sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values; and
comparing the calculated and expected values for the parameter of the same rank to identify genes whose associated values differ by an amount of statistical significance among the sets.
-
-
61. A computer system for analyzing a plurality of original sets of values associated with a plurality of genes to identify genes whose associated values are falsely identified to differ by an amount of statistical significance among the sets, wherein each of the sets of associated values of the genes is obtained from one of a number of data sources, wherein the system comprises:
-
one or more computers;
one or more computer programs running on the computer(s), performing the following;
defining for each gene a statistical parameter indicating differences between associated values of such gene among the original sets;
providing an expected value of such parameter for each gene, wherein said providing includes permuting the associated values in the sets to arrive at sets different from the original sets for each permutation, deriving a value of such parameter for each permutation, and ranking such values;
deriving for each gene a value for the parameter for each permutation and ranking the genes by their derived parameter values;
finding a lowest rank gene whose derived parameter value extends beyond a first threshold; and
comparing the derived parameter values of other genes for permutations to the second threshold and calling each gene whose derived parameter value extends beyond the second threshold as a gene whose associated values are falsely identified to differ by an amount of statistical significance among the sets.
-
-
62. A computer system for reducing statistical error of a set of associated values of genes, wherein the system comprises:
-
one or more computers;
one or more computer programs running on the computer(s), performing the following;
providing a set of associated values of each gene; and
processing said set of associated values of that gene using a smooth weighting function to yield a representative value for that gene.
-
-
63. A computer system for comparing sets of associated values of genes, which comprises:
-
one or more computers;
one or more computer programs running on the computer(s), performing the following;
providing sets of associated values of each gene;
processing said sets of associated values of that gene using a smooth weighting function to obtain a representative value for that gene from each of the sets; and
comparing representative values for that gene for the sets.
-
-
64. A computer system for comparing a first and a second set of associated values of genes comprising
one or more computers; one or more computer programs running on the computer(s), performing the following;
providing odd root values of the values in the first set, and odd root values of the values in the second set; and
comparing the odd root values of the values in the first set and the odd root values of the values in the second sets.
Specification