Systems and methods for generating biomarker signatures with integrated bias correction and class prediction
First Claim
1. A computer-implemented method of classifying a data set into two or more classes, comprising:
- (a) receiving, by a biomarker generator, a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set;
(b) receiving, by the biomarker generator, a test data set;
(c) generating, by the biomarker generator, a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set;
(d) generating, by the biomarker generator, a first test class set by classifying the elements in the test data set according to the first classifier;
(e) transforming, by the biomarker generator, the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set;
(f) for each of a plurality of iterations;
(i) transforming, by the biomarker generator, the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set;
(ii) generating, by the biomarker generator, a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and
(iii) storing, by the biomarker generator, when the first test class set and the second test class set differ, the second test class set as the first test class set and the transformed test data set as the test data set and returning to step (i); and
(g) outputting, by the biomarker generator, when the first test class set is the same as the second test class set, the second test class set.
1 Assignment
0 Petitions
Accused Products
Abstract
Described herein are systems and methods for correcting a data set and classifying the data set in an integrated manner. A training data set, a training class set, and a test data set are received. A first classifier is generated for the training data set by applying a machine learning technique to the training data set and the training class set, and a first test class set is generated by classifying the elements in the test data set according to the first classifier. For each of multiple iterations, the training data set is transformed, the test data set is transformed, and a second classifier is generated by applying a machine learning technique to the transformed training data set. A second test class set is generated according to the second classifier, and the first test class set is compared to the second test class set.
-
Citations
22 Claims
-
1. A computer-implemented method of classifying a data set into two or more classes, comprising:
-
(a) receiving, by a biomarker generator, a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set; (b) receiving, by the biomarker generator, a test data set; (c) generating, by the biomarker generator, a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set; (d) generating, by the biomarker generator, a first test class set by classifying the elements in the test data set according to the first classifier; (e) transforming, by the biomarker generator, the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set; (f) for each of a plurality of iterations; (i) transforming, by the biomarker generator, the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set; (ii) generating, by the biomarker generator, a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and (iii) storing, by the biomarker generator, when the first test class set and the second test class set differ, the second test class set as the first test class set and the transformed test data set as the test data set and returning to step (i); and (g) outputting, by the biomarker generator, when the first test class set is the same as the second test class set, the second test class set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A non-transitory computer-readable storage medium having instructions stored thereon that, when executed in a computerized system comprising at least one processor, cause said at least one processor to carry out operations, the operations comprising:
-
(a) receiving a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set; (b) receiving a test data set; (c) generating a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set; (d) generating a first test class set by classifying the elements in the test data set according to the first classifier; (e) transforming the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set; (f) for each of a plurality of iterations; (i) transforming the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set (ii) generating a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and (iii) when the first test class set and the second test class set differ, storing the second test class set as the first test class set, storing the transformed test data set as the test data set, and returning to step (i); and (g) outputting, when the first test class set is the same as the second test class set, the second test class set. - View Dependent Claims (17, 18)
-
-
19. A computerized system comprising:
-
a biomarker generator; and a memory coupled to the biomarker generator and having instructions stored thereon that, when executed, cause the biomarker generator to perform operations, the operations comprising; (a) receiving a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set; (b) receiving a test data set; (c) generating a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set; (d) generating a first test class set by classifying the elements in the test data set according to the first classifier; (e) transforming the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set; (f) for each of a plurality of iterations; (i) transforming the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set; (ii) generating a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and (iii) storing, when the first test class set and the second test class set differ, the second test class set as the first test class set and the transformed test data set as the test data set and returning to step (i); and (g) outputting, when the first test class set is the same as the second test class set, the second test class set. - View Dependent Claims (20, 21, 22)
-
Specification