Necessary and sufficient reagent sets for chemogenomic analysis
First Claim
1. A method for determining the necessary set of variables for a classification question, said method comprising:
- a. deriving a first linear classifier comprising a first set of variables from a full multivariate dataset, wherein said first linear classifier is capable of answering the classification question with a log odds ratio greater than or equal to a first selected threshold value;
b. removing said first set of variables from the full dataset thereby resulting in a partially depleted dataset;
c. deriving a second linear classifier comprising a second set of variables from the partially depleted dataset, wherein the second linear classifier capable of answering a classification question with a log odds ratio greater than or equal to a second selected threshold value;
d. removing the variables of the second linear classifier from the partially depleted dataset;
e. repeating steps c and d until the second linear classifier generated is not capable of performing with a log odds ratio greater than or equal the first selected threshold value;
wherein the combined set of variables from the derived linear classifiers constitute the necessary set, and the remaining variables in the multivariate dataset constitute the depleted set for answering the classification question.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention discloses methods of data analysis directed to diagnostic development, and in particular the development of signatures for classifying chemogenomic data. The invention provides methods for identifying and functionally characterizing a “necessary” set of information rich variables. The invention also discloses methods for identifying a plurality of “sufficient” classifiers. The necessary set of variables may be incorporated into a single diagnostic device to provide simultaneous confirmation of a classification measurement with a plurality of independent classifiers. In the field of biological diagnostics, the invention may be used to provide a plurality of short lists of genes, referred to as “signatures” that are “sufficient” to carry out specific classification tasks such as predicting the activity and side effects of a compound in vivo.
-
Citations
29 Claims
-
1. A method for determining the necessary set of variables for a classification question, said method comprising:
-
a. deriving a first linear classifier comprising a first set of variables from a full multivariate dataset, wherein said first linear classifier is capable of answering the classification question with a log odds ratio greater than or equal to a first selected threshold value;
b. removing said first set of variables from the full dataset thereby resulting in a partially depleted dataset;
c. deriving a second linear classifier comprising a second set of variables from the partially depleted dataset, wherein the second linear classifier capable of answering a classification question with a log odds ratio greater than or equal to a second selected threshold value;
d. removing the variables of the second linear classifier from the partially depleted dataset;
e. repeating steps c and d until the second linear classifier generated is not capable of performing with a log odds ratio greater than or equal the first selected threshold value;
wherein the combined set of variables from the derived linear classifiers constitute the necessary set, and the remaining variables in the multivariate dataset constitute the depleted set for answering the classification question. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20)
-
-
19. A method for preparing a reagent set comprising:
-
a. deriving a first linear classifier comprising a first set of genes from a full dataset, wherein said first linear classifier is capable of answering a classification question with a log odds ratio greater than or equal to a first selected threshold value;
b. removing said first set of genes from the full dataset thereby resulting in a partially depleted chemogenomic dataset;
c. deriving a second linear classifier comprising a second set of genes from the partially depleted dataset, wherein the second linear classifier capable of answering a classification question with a log odds ratio greater than or equal to a second selected threshold value;
d. removing said second set of genes from the partially depleted dataset;
e. preparing a plurality of isolated polynucleotides or polypeptides, wherein each polynucleotide or polypeptide is capable of detecting at least one gene of said first and second sets genes.
-
- 21. A reagent set for answering a classification question comprising a set of polynucleotides or polypeptides representing a plurality of genes, wherein the addition of a random selection of at least 10% of said plurality of genes to the depleted set for the classification question increases the average logodds ratio of the linear classifiers generated by the depleted set by at least 20%.
-
29. A method of classifying experimental data comprising:
-
a. providing at least two non-overlapping sufficient sets of variables useful for answering a classification question;
b. querying the experimental data with one of the at least two non-overlapping sufficient sets of variables;
c. querying the experimental data with another of the at least two non-overlapping sufficient sets of variables;
wherein the classification of the data is determined based on the answers to the queries generated by the at least two non-overlapping sets of variables.
-
Specification