Systems and methods for generating biomarker signatures with integrated bias correction and class prediction

US 10,339,464 B2
Filed: 06/21/2013
Issued: 07/02/2019
Est. Priority Date: 06/21/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of classifying a data set into two or more classes, comprising:

(a) receiving, by a biomarker generator, a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set;

(b) receiving, by the biomarker generator, a test data set;

(c) generating, by the biomarker generator, a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set;

(d) generating, by the biomarker generator, a first test class set by classifying the elements in the test data set according to the first classifier;

(e) transforming, by the biomarker generator, the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set;

(f) for each of a plurality of iterations;

(i) transforming, by the biomarker generator, the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set;

(ii) generating, by the biomarker generator, a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and

(iii) storing, by the biomarker generator, when the first test class set and the second test class set differ, the second test class set as the first test class set and the transformed test data set as the test data set and returning to step (i); and

(g) outputting, by the biomarker generator, when the first test class set is the same as the second test class set, the second test class set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described herein are systems and methods for correcting a data set and classifying the data set in an integrated manner. A training data set, a training class set, and a test data set are received. A first classifier is generated for the training data set by applying a machine learning technique to the training data set and the training class set, and a first test class set is generated by classifying the elements in the test data set according to the first classifier. For each of multiple iterations, the training data set is transformed, the test data set is transformed, and a second classifier is generated by applying a machine learning technique to the transformed training data set. A second test class set is generated according to the second classifier, and the first test class set is compared to the second test class set.

Citations

22 Claims

1. A computer-implemented method of classifying a data set into two or more classes, comprising:
- (a) receiving, by a biomarker generator, a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set;
  
  (b) receiving, by the biomarker generator, a test data set;
  
  (c) generating, by the biomarker generator, a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set;
  
  (d) generating, by the biomarker generator, a first test class set by classifying the elements in the test data set according to the first classifier;
  
  (e) transforming, by the biomarker generator, the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set;
  
  (f) for each of a plurality of iterations;
  
  (i) transforming, by the biomarker generator, the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set;
  
  (ii) generating, by the biomarker generator, a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and
  
  (iii) storing, by the biomarker generator, when the first test class set and the second test class set differ, the second test class set as the first test class set and the transformed test data set as the test data set and returning to step (i); and
  
  (g) outputting, by the biomarker generator, when the first test class set is the same as the second test class set, the second test class set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the elements of the training data set represent gene expression data for a patient with a disease, for a patient resistant to the disease, or for a patient without the disease.
  - 3. The method of claim 1, wherein the training data set is formed from a random subset of samples in an aggregate data set, and the test data set is formed from a remaining subset of samples in the aggregate data set.
  - 4. The method of claim 1, wherein:
    - the test data set includes a rest set of known labels, each known label identifying a class associated with each element in the test data set;
      
      the first test class set includes a set of predicted labels for the test data set; and
      
      the second test class set includes a set of predicted labels for the transformed test data set.
  - 5. The method of claim 1, wherein the shifting at step (e) includes applying a rotation, a linear transformation, or a non-linear transformation to the training data set to obtain the transformed training data set.
  - 6. The method of claim 1, wherein the shifting at step (i) includes applying a rotation, a shear, a linear transformation, or a nonlinear transformation to the test data set to obtain the transformed test data set.
  - 7. The method of claim 1, further comprising:
    - (h) comparing, by the biomarker generator, the first test class set to the second test class set for each of the plurality of iterations.
  - 8. The method of claim 1, wherein the transforming at step (e) is performed by applying the same transformation of step (i).
  - 9. The method of claim 1, further comprising:
    - (h) providing, by the biomarker generator, the second test class set to a display device, a printing device, or a storing device.
  - 10. The method of claim 1, wherein the first test class set and the second test class set differ if any element of the first test class set differs from a corresponding element of the second test class set.
  - 11. The method of claim 1, wherein the second test class set includes a set of predicted labels for the transformed test data set, the method further comprising:
    - evaluating, by the biomarker generator, the second classifier by computing a performance metric representative of a number of correct predicted labels in the second test class set divided by a total number of predicted labels.
  - 12. The method of claim 1, wherein the first machine learning technique and the second machine learning technique are the same.
  - 13. The method of claim 1, further comprising:
    - (h) determining, by the biomarker generator, based on the second class set, a set of candidate biomarkers and at least one candidate error rate.
  - 14. The method of claim 13, further comprising:
    - (i) receiving, by a biomarker consolidator in communication with the biomarker generator, the set of candidate biomarkers and the candidate error rate;
      
      (j) determining, by the biomarker generator, a performance measure and size of each set of candidate biomarkers; and
      
      (k) selecting, by the biomarker generator, based on the performance measure and size of each set of candidate biomarkers, an optimal biomarker.
  - 15. The method of claim 13, further comprising:
    - (l) controlling, by a central control unit in communication with the biomarker generator and the biomarker consolidator, at least partially the operation of the biomarker generator and the biomarker consolidator.

16. A non-transitory computer-readable storage medium having instructions stored thereon that, when executed in a computerized system comprising at least one processor, cause said at least one processor to carry out operations, the operations comprising:
- (a) receiving a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set;
  
  (b) receiving a test data set;
  
  (c) generating a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set;
  
  (d) generating a first test class set by classifying the elements in the test data set according to the first classifier;
  
  (e) transforming the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set;
  
  (f) for each of a plurality of iterations;
  
  (i) transforming the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set(ii) generating a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and
  
  (iii) when the first test class set and the second test class set differ, storing the second test class set as the first test class set, storing the transformed test data set as the test data set, and returning to step (i); and
  
  (g) outputting, when the first test class set is the same as the second test class set, the second test class set.
- View Dependent Claims (17, 18)
- - 17. The non-transitory computer-readable storage medium of claim 16, the operations further comprising:
    - (h) outputting, when the first test class set and the second test class set do not differ, the second test class set.
  - 18. The non-transitory computer-readable storage medium of claim 16, wherein the elements of the training data set represent gene expression data for a patient with a disease, for a patient resistant to the disease, or for a patient without the disease.

19. A computerized system comprising:
- a biomarker generator; and
  
  a memory coupled to the biomarker generator and having instructions stored thereon that, when executed, cause the biomarker generator to perform operations, the operations comprising;
  
  (a) receiving a training data set and a training class set, the training class set including a set of known labels, each known label identifying a class associated with each element in the training data set;
  
  (b) receiving a test data set;
  
  (c) generating a first classifier for the training data set by applying a first machine learning technique to the training data set and the training class set;
  
  (d) generating a first test class set by classifying the elements in the test data set according to the first classifier;
  
  (e) transforming the training data set by shifting the elements in the training data set by an amount corresponding to a center of a set of training class centroids, wherein each training class centroid is representative of a center of a subset of elements in the training data set;
  
  (f) for each of a plurality of iterations;
  
  (i) transforming the test data set by shifting the elements in the test data set by an amount corresponding to a center of a set of test class centroids, wherein each test class centroid is representative of a center of a subset of elements in the test data set;
  
  (ii) generating a second test class set by classifying the elements in the transformed test data set according to a second classifier, wherein the second classifier is generated by applying a second machine learning technique to the transformed training data set and the training class set; and
  
  (iii) storing, when the first test class set and the second test class set differ, the second test class set as the first test class set and the transformed test data set as the test data set and returning to step (i); and
  
  (g) outputting, when the first test class set is the same as the second test class set, the second test class set.
- View Dependent Claims (20, 21, 22)
- - 20. The computerized system of claim 19, the operations further comprising:
    - (h) outputting, when the first test class set and the second test class set do not differ, the second test class set.
  - 21. The computerized system of claim 19, wherein the elements of the training data set represent gene expression data for a patient with a disease, for a patient resistant to the disease, or for a patient without the disease.
  - 22. The computerized system of claim 19, wherein the training data set is formed from a random subset of samples in an aggregate data set, and the test data set is formed from a remaining subset of samples in the aggregate data set.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Philip Morris Products SA (Philip Morris Limited)
Original Assignee
Philip Morris Products SA (Philip Morris Limited)
Inventors
Martin, Florian, Xiang, Yang
Primary Examiner(s)
Cassity, Robert A
Assistant Examiner(s)
Coughlan, Peter D

Application Number

US14/409,681
Publication Number

US 20150178639A1
Time in Patent Office

2,202 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/285   Clustering or classification

G06N 20/00   Machine learning

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16H 50/20   for computer-aided diagnosi...

Systems and methods for generating biomarker signatures with integrated bias correction and class prediction

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for generating biomarker signatures with integrated bias correction and class prediction

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links