Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques

US 10,373,708 B2
Filed: 06/21/2013
Issued: 08/06/2019
Est. Priority Date: 06/21/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of classifying a data set into two or more classes executed by a processor, comprising:

(a) receiving a training data set associated with the data set and having a set of known labels, wherein the data set comprises gene set data, and each gene set data corresponds to one of a plurality of biological state classes, and wherein the labels identify the biological state classes of the gene set data;

(b) generating a first classifier for the training data set by applying a first machine learning technique to the training data set, wherein the first machine learning technique identifies a first set of classification methods, wherein each classification method votes on the training data set;

(c) classifying elements in the training data set according to the first classifier to obtain a first set of predicted labels for the training data set;

(d) computing a first objective value from the first set of predicted labels and the set of known labels;

(e) for each of a plurality of iterations, performing the following steps (i)-(v);

(i) generating a second classifier for the training data set by applying a second machine learning technique to the training data set, wherein the second machine learning technique identifies a second set of classification methods that is different from the first set of classification methods by at least one classification method, wherein each classification method votes on the training data set;

ii) classifying the elements in the training data set according to the second classifier to obtain a second set of predicted labels for the training data set;

(iii) computing a second objective value from the second set of predicted labels and the set of known labels;

(iv) comparing the first objective value to the second objective value to determine whether the second classifier outperforms the first classifier; and

(v) replacing the first set of predicted labels with the second set of predicted labels and replacing the first objective value with the second objective value when the second classifier outperforms the first classifier, and return to step (i); and

(f) when a desired number of iterations has been reached, outputting the first set of predicted labels.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described herein are systems and methods for classifying a data set using an ensemble classification technique. Classifiers are iteratively generated by applying machine learning techniques to a training data set, and training class sets are generated by classifying the elements in the training data set according to the classifiers. Objective values are computed based on the training class sets, and objective values associated with different classifiers are compared until a desired number of iterations is reached, and a final training class set is output.

17 Citations

20 Claims

1. A computer-implemented method of classifying a data set into two or more classes executed by a processor, comprising:
- (a) receiving a training data set associated with the data set and having a set of known labels, wherein the data set comprises gene set data, and each gene set data corresponds to one of a plurality of biological state classes, and wherein the labels identify the biological state classes of the gene set data;
  
  (b) generating a first classifier for the training data set by applying a first machine learning technique to the training data set, wherein the first machine learning technique identifies a first set of classification methods, wherein each classification method votes on the training data set;
  
  (c) classifying elements in the training data set according to the first classifier to obtain a first set of predicted labels for the training data set;
  
  (d) computing a first objective value from the first set of predicted labels and the set of known labels;
  
  (e) for each of a plurality of iterations, performing the following steps (i)-(v);
  
  (i) generating a second classifier for the training data set by applying a second machine learning technique to the training data set, wherein the second machine learning technique identifies a second set of classification methods that is different from the first set of classification methods by at least one classification method, wherein each classification method votes on the training data set;
  
  ii) classifying the elements in the training data set according to the second classifier to obtain a second set of predicted labels for the training data set;
  
  (iii) computing a second objective value from the second set of predicted labels and the set of known labels;
  
  (iv) comparing the first objective value to the second objective value to determine whether the second classifier outperforms the first classifier; and
  
  (v) replacing the first set of predicted labels with the second set of predicted labels and replacing the first objective value with the second objective value when the second classifier outperforms the first classifier, and return to step (i); and
  
  (f) when a desired number of iterations has been reached, outputting the first set of predicted labels.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein the training data set is formed by selecting a subset of training data samples from an aggregate training data set, the method further comprising bootstrapping the aggregate training data set to generate a plurality of additional training data sets, and repeating the steps (a) through (f) for each additional training data set.
  - 3. The method of claim 2, wherein the bootstrapping is performed with balanced samples.
  - 4. The method of claim 1, further comprising:
    - selecting a sample in a test data set that is different from the training data set and does not have a set of known labels; and
      
      using the identified classifier to predict a label for the selected sample.
  - 5. The method of claim 1, wherein:
    - the first set of classification methods is obtained by using a first random vector to select a subset of an aggregate set of classification methods;
      
      the first random vector includes a set of binary values corresponding to the aggregate set of classification methods;
      
      each binary value indicates whether the corresponding classification method in the aggregate set is included in the first set of classification methods; and
      
      the second set of classification methods is obtained by using a second random vector including a different set of binary values.
  - 6. The method of claim 5, wherein the first random vector includes parameters include a flag variable indicating whether to perform balanced bootstrapping, a number of bootstraps, a list of classification methods, a list of genes, or a combination thereof.
  - 7. The method of claim 5, wherein the step of computing the second objective value comprises implementing a simulated annealing method.
  - 8. The method of claim 7, wherein the simulated annealing method comprises updating one or more values of the first random vector to obtain the second random vector.
  - 9. The method of claim 8, wherein:
    - updating the one or more values of the first random vector comprises randomly updating each element of the first random vector to obtain the second random vector.
  - 10. The method of claim 5, wherein:
    - the plurality of iterations includes a first set of iterations and a second set of iterations; and
      
      each subsequent second random vector differs from a previous second random vector by an amount that is larger for the first set of iterations than for the second set of iterations.
  - 11. The method of claim 10, wherein for each iteration in the first set of iterations and in the second set of iterations,a first subset of the subsequent second random vector is selected to be the same as a corresponding first subset of the previous second random vector,a second subset of the subsequent second random vector is selected to be different from a corresponding second subset of the previous second random vector,a size of the first subset is smaller for the first set of iterations than for the second set of iterations, anda size of the second subset is larger for the first set of iterations than for the second set of iterations.
  - 12. The method of claim 11, wherein the size of the second subset for the first set of iterations is approximately 20% of the length of the second random vector, and the size of the second subset for the second set of iterations is one.
  - 13. The method of claim 1, wherein the second objective value corresponds to a Matthew correlation coefficient that is assessed from the second set of predicted labels and the set of known labels.
  - 14. The method of claim 1, further comprising:
    - determining that the second classifier outperforms than the first classifier when the second objective value is less than the first objective value, and if the second objective value is greater than the first objective value, when a random value is less than a probability value that is computed from the first objective value and the second objective value.
  - 15. The method of claim 14, wherein the probability value is computed from a control parameter q, the first objective value, the second objective value, and a temperature value that is computed from a cooling formula.
  - 16. The method of claim 1, wherein the second classifier is selected from a group comprising linear discriminant analysis, support vector machine-based methods, random forest methods, and k nearest neighbor methods.
  - 17. The method of claim 1, wherein the biological state classes indicate diseased or diseased-free.
  - 18. The method of claim 1, wherein the elements in the training data set are classified using a classification rule associated with the first classifier to obtain the first set of predicted labels for the training data set, and the elements in the training data set are classified using a classification rule associated with the second classifier to obtain the second set of predicted labels for the training data set.

19. A computer program product comprising computer-readable instructions that, when executed in a computerized system comprising at least one processor, cause the processor to carry out a method, comprising:
- (a) receiving a training data set associated with a data set and having a set of known labels, wherein the data set comprises gene set data, and each gene set data corresponds to one of a plurality of biological state classes, and wherein the labels identify the biological state classes of the gene set data;
  
  (b) generating a first classifier for the training data set by applying a first machine learning technique to the training data set, wherein the first machine learning technique identifies a first set of classification methods, wherein each classification method votes on the training data set;
  
  (c) classifying elements in the training data set according to the first classifier to obtain a first set of predicted labels for the training data set;
  
  (d) computing a first objective value from the first set of predicted labels and the set of known labels;
  
  (e) for each of a plurality of iterations, performing the following steps (i)-(v);
  
  (i) generating a second classifier for the training data set by applying a second machine learning technique to the training data set, wherein the second machine learning technique identifies a second set of classification methods that is different from the first set of classification methods by at least one classification method, wherein each classification method votes on the training data set;
  
  (ii) classifying the elements in the training data set according to the second classifier to obtain a second set of predicted labels for the training data set;
  
  (iii) computing a second objective value from the second set of predicted labels and the set of known labels(iv) comparing the first objective value to the second objective value to determine whether the second classifier outperforms the first classifier; and
  
  (v) replacing the first set of predicted labels with the second set of predicted labels and replacing the first objective value with the second objective value when the second classifier outperforms the first classifier, and return to step (i); and
  
  (f) when a desired number of iterations has been reached, outputting the first set of predicted labels.

20. A computerized system comprising a processing device configured with non-transitory computer-readable instructions that, when executed, cause the processing device to carry out a method comprising:
- (a) receiving a training data set associated with a data set and having a set of known labels, wherein the data set comprises gene set data, and each gene set data corresponds to one of a plurality of biological state classes, and wherein the labels identify the biological state classes of the gene set data;
  
  (b) generating a first classifier for the training data set by applying a first machine learning technique to the training data set, wherein the first machine learning technique identifies a first set of classification methods, wherein each classification method votes on the training data set;
  
  (c) classifying elements in the training data set according to the first classifier to obtain a first set of predicted labels for the training data set;
  
  (d) computing a first objective value from the first set of predicted labels and the set of known labels;
  
  (e) for each of a plurality of iterations, performing the following steps (i)-(v);
  
  (i) generating a second classifier for the training data set by applying a second machine learning technique to the training data set, wherein the second machine learning technique identifies a second set of classification methods that is different from the first set of classification methods by at least one classification method, wherein each classification method votes on the training data set;
  
  (ii) classifying the elements in the training data set according to the second classifier to obtain a second set of predicted labels for the training data set;
  
  (iii) computing a second objective value from the second set of predicted labels and the set of known labels;
  
  (iv) comparing the first objective value to the second objective value to determine whether the second classifier outperforms the first classifier;
  
  (v) replacing the first set of predicted labels with the second set of predicted labels and replacing the first objective value with the second objective value when the second classifier outperforms the first classifier, and return to step (i); and
  
  (f) when a desired number of iterations has been reached, outputting the first set of predicted labels.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Philip Morris Products SA (Philip Morris Limited)
Original Assignee
Philip Morris Products SA (Philip Morris Limited)
Inventors
Xiang, Yang, Hoeng, Julia, Martin, Florian
Primary Examiner(s)
Nilsson, Eric

Application Number

US14/409,679
Publication Number

US 20150154353A1
Time in Patent Office

2,237 Days
Field of Search
US Class Current
CPC Class Codes

G06N 20/00   Machine learning

G06N 5/027   Frames

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others