Estimating the accuracy of molecular property models and predictions

US 7,194,359 B2
Filed: 06/29/2005
Issued: 03/20/2007
Est. Priority Date: 06/29/2004
Status: Active Grant

First Claim

Patent Images

1. A method for estimating an accuracy of a molecular properties model comprising:

selecting a dataset, wherein the dataset includes at least one molecule description in a form appropriate for the molecular properties model and a value for a molecular property;

providing the dataset to the molecular properties model to obtain a prediction for each molecule represented by a molecule description in the dataset; and

estimating a confidence interval or bound on the accuracy of the molecular properties model in generating a prediction for a test molecule, based on the obtained predictions, relative to a selected measure of performance, wherein the molecular properties model generates predictions related to a property of interest selected from at least one of a physiological activity, pharmacokinetic property, pharmacodynamic property, physiological or pharmacological activity, toxicity or selectivity;

a chemical property including reactivity, binding affinity, pKa, or a property of a specific atom or bond in a molecule;

or a physical property including melting point, solubility, a membrane permeability, or a force-field parameter.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the invention provide methods for evaluating the accuracy of a molecular model properties model (or predictions generated using a molecular properties model). The accuracy of a molecular properties model may be evaluated using three general approaches, (i) by using the same data set to both train the model and to estimate the accuracy of the model, (ii) by using distinct data sets to train and subsequently test a model, and (iii) by using multiple models (or sets of predictions).

16 Citations

View as Search Results

33 Claims

1. A method for estimating an accuracy of a molecular properties model comprising:
- selecting a dataset, wherein the dataset includes at least one molecule description in a form appropriate for the molecular properties model and a value for a molecular property;
  
  providing the dataset to the molecular properties model to obtain a prediction for each molecule represented by a molecule description in the dataset; and
  
  estimating a confidence interval or bound on the accuracy of the molecular properties model in generating a prediction for a test molecule, based on the obtained predictions, relative to a selected measure of performance, wherein the molecular properties model generates predictions related to a property of interest selected from at least one of a physiological activity, pharmacokinetic property, pharmacodynamic property, physiological or pharmacological activity, toxicity or selectivity;
  
  a chemical property including reactivity, binding affinity, pKa, or a property of a specific atom or bond in a molecule;
  
  or a physical property including melting point, solubility, a membrane permeability, or a force-field parameter.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the dataset contains at least one molecule description corresponding to a molecule that was also included in a training dataset used to train the molecular properties model.
  - 3. The method of claim 1, wherein the selected measure of performance is selected from at least one of:
    - the area above an ROC curve, and F-Source, or an Epsilon-insensitive loss.
  - 4. The method of claim 1, wherein the value for the molecular property, for at least one molecule description, is generated using an in silico computational simulation.
  - 5. The method of claim 1, wherein concentration inequalities are used to bound the accuracy of the molecular properties model.
  - 6. The method of claim 1, wherein the molecule descriptions are chosen to represent molecules that are maximally different from molecules included in a training dataset used to train the molecular properties model.
  - 7. The method of claim 1, wherein the dataset is chosen to include molecule descriptions representing molecules with different scaffolds than the descriptions representing molecules included in a training dataset used to train the molecular properties.
  - 8. The method of claim 1, wherein the contribution made by the molecule descriptions in the dataset to the molecular properties model are weighted, relative to one another.
  - 9. The method of claim 1, wherein measures of model complexity of “
    - luckiness”
      
      functions are used to bound the accuracy of the molecular properties model.
  - 10. The method of claim 1, wherein a calculated variance of the measurements of accuracy on the individual molecule descriptions in the dataset are used to provide a confidence interval for the accuracy of the molecular properties model.

11. A method for estimating the accuracy of a first molecular properties model trained using a first training dataset, comprising:
- generating a plurality of molecular properties models by repeating;
  
  (i) modifying the first training dataset to generate a modified training dataset;
  
  (ii) generating a second molecular properties model corresponding to the modified training dataset by performing a selected machine learning algorithm using the modified dataset;
  
  (iii) modifying the first training dataset to provide a test dataset to the second molecular properties model; and
  
  (iv) obtaining predictions for molecules, each represented by a molecule description, included in the test dataset;
  
  (v) estimating the accuracy of the second molecular properties model based on the predictions, relative to a selected measure of performance; and
  
  estimating a confidence interval or bound on the accuracy of the first molecular properties model in generating a prediction for a test molecule, relative to the selected measure of performance, using the estimates of the accuracy of the plurality of molecular properties models.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 12. The method of claim 11, wherein the selected measure of performance is selected from at least one of:
    - the area above an ROC curve, and F-Score, or an Epsilon-insensitive loss.
  - 13. The method of claim 11, wherein the value for the molecular property of at least one molecule included in the first training dataset is generated using in silico computational simulation.
  - 14. The method of claim 11, wherein concentration inequalities are used to bound the accuracy of the first molecular properties model.
  - 15. The method of claim 11, wherein the molecules included in the test dataset are chosen to be maximally different from the molecules included in the modified training dataset.
  - 16. The method of claim 11, wherein the test dataset is chosen to include molecules that have different scaffolds than the molecules include in the modified training dataset.
  - 17. The method of claim 11, wherein the first molecular properties model generates predictions related to a property, of interest selected from at least one of a physiological or pharmacological activity, toxicity or selectivity;
    - a chemical property including reactivity, binding affinity, pka, or a property of a specific atom or bond in a molecule;
      
      or a physical property including melting point, solubility, a membrane permeability, of a force-field parameter.
  - 18. The method of claim 11, wherein the contribution made by the molecule descriptions in the dataset to the molecular properties model are weighted, relative to one another.
  - 19. The method of claim 11, wherein measures of model complexity or “
    - luckiness”
      
      functions are used to bound the accuracy of the first molecular properties model.
  - 20. The method of claim 11, wherein the variance of the measurements of accuracy on the plurality of molecular properties models is used to provide a confidence interval regarding the accuracy of the first molecular properties model.
  - 21. The method of claim 11, wherein the first training dataset is modified by selecting reduced sets of molecules by cross-validation, stratified cross-validation bootstrapping, sub-sampling, or leave-one-out.
  - 22. The method of claim 11, wherein the first training dataset is modified at step i by changing or permuting the values for the molecular property.
  - 23. The method of claim 11, wherein the plurality of molecular properties models is used to estimate the empirical VC-dimension, the Rademacher complexity, or another empirical estimate of the complexity of the model class from which the first molecular properties model is selected.

24. A computer-readable medium containing a program which, when executed by a processor, performs operations comprising:
- receiving a dataset, wherein the dataset includes at least one molecule description in a form appropriate for the molecular properties model and a value for a molecular property;
  
  providing the dataset to the molecular properties model to obtain a prediction for each molecule represented by a molecule description in the dataset; and
  
  estimating a confidence interval or bound on the accuracy of the molecular properties model in generating a prediction for a test molecule, based on the obtained predictions, relative to a selected measure of performance, wherein the molecular properties model generates predictions related to a property of interest selected from at least one of a physiological activity, pharmacokinetic property, pharmacodynamic property, physiological or pharmacological activity, toxicity or selectivity;
  
  a chemical property including reactivity, binding affinity, pKa, or a property of a specific atom or bond in a molecule;
  
  or a physical property including melting point, solubility, a membrane permeability, or a force-field parameter.
- View Dependent Claims (25, 26, 27, 28, 29)
- - 25. The computer-readable medium of claim 24, wherein the dataset contains at least one molecule description corresponding to a molecule that was also included in a training dataset used to train the molecular properties model.
  - 26. The computer-readable medium of claim 24, wherein the molecule descriptions are chosen to represent molecules that are maximally different from molecules included in a training dataset used to train the molecular properties model.
  - 27. The computer-readable medium of claim 24, wherein the selected measure of performance is selected from at least one of:
    - the area above an ROC curve, an F-Score, or an Epsilon-insensitive loss.
  - 28. The computer-readable medium of claim 24, wherein the value for the molecular property for at least one molecule description in the dataset is generated using an in silico computational simulation.
  - 29. The computer-readable medium of claim 24, wherein the dataset is chosen to include molecule descriptions representing molecules with different scaffolds than the descriptions representing molecules included in a training dataset, used to train the molecular properties model.

30. A computer-readable medium containing a program which, when executed by a processor, performs operations for estimating the accuracy of a first molecular properties model trained using a first training dataset comprisinggenerating a plurality of molecular properties models by repeating:
- (i) modifying the first training dataset to generate a modified training dataset;
  
  (ii) generating a second molecular properties model corresponding to the modified training dataset by performing a selected machine learning algorithm using the modified dataset;
  
  (iii) modifying the first training dataset to provide a test dataset to the second molecular properties model; and
  
  (iv) obtaining predictions for molecules, each represented by a molecule description, included in the test dataset;
  
  (v) estimating the accuracy of the second molecular properties model based on the predictions, relative to a selected measure of performance; and
  
  estimating a confidence interval or bound on the accuracy of the first molecular properties model in generating a prediction for a test molecule, relative to the selected measure of performance, using the estimates of the accuracy of the plurality of molecular properties models.
- View Dependent Claims (31, 32, 33)
- - 31. The computer-readable medium of claim 30, wherein the first training dataset is modified by selecting reduced sets of molecules by cross-validation, stratified cross-validation, bootstrapping, sub-sampling, or leave-one-out.
  - 32. The computer-readable medium of claim 30, wherein the first training dataset is modified at step i by changing or permuting the values for the molecular property.
  - 33. The computer-readable medium of claim 30, wherein the plurality of molecular properties models is used to estimate the empirical VC-dimension, the Rademacher complexity, or another empirical estimate of the complexity of the model class from which the first molecular properties model is selected.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
VALO Health Incorporated
Original Assignee
PHARMIX CORP. (Numerate, Inc.)
Inventors
Duffy, Nigel P., Lanza, Guido, Yu, Jessen
Primary Examiner(s)
Raymond; Edward

Application Number

US11/172,216
Publication Number

US 20050288871A1
Time in Patent Office

629 Days
Field of Search

702/27, 702 28- 30, 435/40.5
US Class Current

702/27
CPC Class Codes

G16C 20/30 Prediction of properties of...

G16C 20/70 Machine learning, data mini...

Estimating the accuracy of molecular property models and predictions

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

16 Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Estimating the accuracy of molecular property models and predictions

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links