Method and system for artificial intelligence directed lead discovery through multi-domain clustering

US 6,904,423 B1
Filed: 02/18/2000
Issued: 06/07/2005
Est. Priority Date: 02/19/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method for screening a set of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, each molecule having a feature characteristic and an activity characteristic, the method comprising, in combination:

(a) with respect to the molecules;

(i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new set of molecules consisting of the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new set of molecules, and, if so, (D) repeating steps (i)-(iv) with respect to the new set of molecules; and

(b) providing a description of at least one new set of molecules established in step (iv), the description including a first portion indicating the feature set for which the new set of molecules was established and a second portion indicating the activity characteristics of the molecules in the new set of molecules, whereby the first and second portions may cooperatively establish a correlation between molecular features and molecular activity.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for analyzing a vast amount of data representative of chemical structure and activity information and concisely providing conclusions about structure-to-activity relationships. A computer may adaptively learn new substructure descriptors based on its analysis of the input data. The computer may then apply each substructure descriptor as a filter to establish new groups of molecules that match the descriptor. From each new group of molecules, the computer may in turn generate one or more additional new groups of molecules. A result of the analysis in an exemplary arrangement is a tree structure that reflects pharmacophoric information and efficiently establishes through lineage what effect on activity various chemical substructures are likely to have. The tree structure can then be applied as a multi-domain classifier, to help a chemist classify test compounds into structural subclasses.

Citations

72 Claims

1. A method for screening a set of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, each molecule having a feature characteristic and an activity characteristic, the method comprising, in combination:
- (a) with respect to the molecules;
  
  (i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new set of molecules consisting of the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new set of molecules, and, if so, (D) repeating steps (i)-(iv) with respect to the new set of molecules; and
  
  (b) providing a description of at least one new set of molecules established in step (iv), the description including a first portion indicating the feature set for which the new set of molecules was established and a second portion indicating the activity characteristics of the molecules in the new set of molecules, whereby the first and second portions may cooperatively establish a correlation between molecular features and molecular activity.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 2. A method as claimed in claim 1, wherein the set of molecules consists of molecules determined to have at least a designated activity characteristic.
  - 3. A method as claimed in claim 1, wherein the activity characteristic is multi-dimensional.
  - 4. A method as claimed in claim 1, wherein defining groups of the molecules based on similarity of the feature characteristics of the molecules comprises (i) establishing for each molecule a feature vector based on the feature characteristic of the molecule, and (ii) clustering the feature vectors of the molecules based on similarity of the feature vectors.
  - 5. A method as claimed in claim 4, wherein each feature vector is keyed to structural descriptors, and wherein each of the structural descriptors is selected from the group consisting of (i) a MACCS key, (ii) a BCI key, and (iii) a Daylight fingerprint key.
  - 6. A method as claimed in claim 4, wherein clustering the feature vectors comprises applying a clustering process selected from the group consisting of (i) self-organizing map, (ii) agglomerative clustering, and (iii) divisive clustering.
  - 7. A method as claimed in claim 6, wherein the clustering process uses a similarity measure selected from the group consisting of (i) a Euclidean distance, (ii) a Tanimoto distance, (iii) a Tversky coefficient, (iv) a Euclidean-Soergel product, and (v) a Euclidean-Tanimoto product.
  - 8. A method as claimed in claim 6, wherein clustering the feature vectors comprises applying a self-organizing map.
  - 9. A method as claimed in claim 6, wherein clustering the feature vectors comprises applying an agglomerative clustering process selected from the group consisting of (i) Wards, (ii) complete-link, (iii) average link, (iv) single link, and (v) centroid.
  - 10. A method as claimed in claim 6, wherein clustering the feature vectors comprises applying a divisive clustering process selected from the group consisting of (i) recursive partitioning, (ii) DIANA algorithm, and (iii) MONA algorithm.
  - 11. A method as claimed in claim 6, wherein the clustering process produces a number of clusters, and wherein each group of molecules comprises a cluster selected from the number of clusters.
  - 12. A method as claimed in claim 6, wherein the clustering process produces a number of clusters, and wherein each group of molecules comprises a metacluster derived from the number of clusters.
  - 13. A method as claimed in claim 12, further comprising selecting the metacluster by a process selected from the group consisting (i) Kelley method, (ii) point-biserial method, (iii) Hubert'"'"'s Gamma method, and (iv) Fagan'"'"'s method.
  - 14. A method as claimed in claim 1, wherein selecting one or more of the groups of molecules comprises selecting groups having at least a threshold concentration of a specified activity characteristic.
  - 15. A method as claimed in claim 1, wherein identifying a feature set common to all molecules in a group comprises identifying a chemical structure present in all molecules in the group.
  - 16. A method as claimed in claim 15, wherein the chemical structure comprises an arrangement of atoms and bonds.
  - 17. A method as claimed in claim 16, wherein the arrangement of atoms and bonds is a contiguous arrangement.
  - 18. A method as claimed in claim 1, wherein identifying a feature set common to all molecules in a group comprises identifying a 2D substructure common to all of the molecules in the group.
  - 19. A method as claimed in claim 18, wherein identifying a 2D substructure common to all of the molecules in the group comprises applying a process selected from the group consisting of (i) an exhaustive maximum common substructure search, (ii) a genetic algorithm common substructure search, (iii) a weighted exhaustive maximum common substructure search, and (iv) a weighted genetic algorithm maximum common substructure search.
  - 20. A method as claimed in claim 1, wherein identifying a feature set common to all molecules in a group comprises identifying a 3D substructure common to all of the molecules in the group.
  - 21. A method as claimed in claim 1, wherein identifying a feature set common to all molecules in a group comprises identifying a largest chemical substructure common to all molecules in the group.
  - 22. A method as claimed in claim 21, wherein identifying a largest chemical substructure common to all of the molecules in the group comprises applying a genetic algorithm.
  - 23. A method as claimed in claim 21, wherein identifying a largest chemical substructure common to all of the molecules in the group comprises exhaustively searching for and identifying common substructures among the molecules in the group and selecting the largest chemical substructure from the identified common substructures.
  - 24. A method as claimed in claim 21, wherein identifying a largest chemical substructure common to all of the molecules in the group comprises comparing graphs of the molecules in the group.
  - 25. A method as claimed in claim 1, wherein selecting from the molecules a number of molecules that exhibit the feature set comprises selecting all of the molecules that exhibit the feature set, wherein all of the molecules is one or more molecules.
  - 26. A method as claimed in claim 1, wherein providing a description of at least one new set established in step (iv) comprises displaying a tree structure comprising a root node reflecting the data set and descendent nodes reflecting new data sets established in step (iv).
  - 27. A method as claimed in claim 1, wherein the description further comprises a third portion indicating a measure of activity differential between a pair of feature sets for which successive new data sets were established.

28. A method for screening a data set representing a plurality of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, the data set defining, for each represented molecule, a feature characteristic and an activity characteristic, the method comprising, in combination:
- (a) with respect to the molecules represented by the data set;
  
  (i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step, identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new data set representing the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new data set, and, if so, (D) repeating the method from step (i) with respect to the molecules represented by the new data set; and
  
  (b) providing a description of at least one new data set established in step (iv), the description including a first portion indicating the feature set for which the new data set was established and a second portion indicating the activity characteristics of the molecules represented by the new data set, whereby the first and second portion may cooperatively establish a correlation between molecular features and activity.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
- - 29. A method as claimed in claim 28, wherein the plurality of molecules represented by the data set consists of molecules determined to have at least a designated activity characteristic.
  - 30. A method as claimed in claim 28, wherein the activity characteristic is multi-dimensional.
  - 31. A method as claimed in claim 28, wherein defining groups of the molecules based on similarity of the feature characteristics of the molecules comprises (i) establishing for each molecule a feature vector based on the feature characteristic of the molecule, and (ii) clustering the feature vectors of the molecules based on similarity of the feature vectors.
  - 32. A method as claimed in claim 31, wherein clustering the feature vectors comprises applying a self-organizing-map.
  - 33. A method as claimed in claim 32, wherein each group of molecules comprises a cluster of the self-organizing map.
  - 34. A method as claimed in claim 32, wherein at least one group of molecules comprises a metacluster of the self-organizing map.
  - 35. A method as claimed in claim 31, wherein clustering the feature vectors comprises applying Wards clustering.
  - 36. A method as claimed in claim 28, wherein selecting one or more of the groups of molecules comprises selecting groups having at least a threshold concentration of a specified activity characteristic.
  - 37. A method as claimed in claim 28, wherein identifying a feature set common to all molecules in a group comprises identifying a chemical structure present in all molecules in the group.
  - 38. A method as claimed in claim 37, wherein the chemical structure comprises an arrangement of atoms and bonds.
  - 39. A method as claimed in claim 38, wherein the arrangement of atoms and bonds is a contiguous arrangement.
  - 40. A method as claimed in claim 28, wherein identifying a feature set common to all molecules in a group comprises identifying a largest chemical substructure common to all molecules in the group.
  - 41. A method as claimed in claim 40, wherein identifying a largest chemical substructure common to all of the molecules in the group comprises applying a genetic algorithm.
  - 42. A method as claimed in claim 40, wherein identifying a largest chemical substructure common to all of the molecules in the group comprises exhaustively searching for and identifying common substructures among the molecules in the group and selecting the largest chemical substructure from the identified common substructures.
  - 43. A method as claimed in claim 40, wherein identifying a largest chemical substructure common to all of the molecules in the group comprises comparing graphs of the molecules in the group.
  - 44. A method as claimed in claim 28, wherein selecting from the molecules a number of molecules that exhibit the feature set comprises selecting all of the molecules that exhibit the feature set, wherein all of the molecules is one or more molecules.
  - 45. A method as claimed in claim 28, wherein providing a description of at least one new data set established in step (iv) comprises displaying a tree structure comprising a root node reflecting the data set and descendent nodes reflecting new data sets established in step (iv).
  - 46. A method as claimed in claim 28, wherein the description further comprises a third segment indicating a measure of activity differential between a pair of feature sets for which successive new data sets were established.
  - 47. A computer-readable medium embodying a set of machine language instructions executable by a computer for performing the method steps of claim 28.

48. A method for screening a data set representing molecules, the data set defining, for each represented molecule, a feature characteristic and an activity characteristic, the method comprising, in combination:
- (a) selecting from the data set at least one group of molecules that have similar feature characteristics and that cooperatively represent a particular activity characteristic, said group of molecules having a set of discriminating features defining similarity of the molecules in said group;
  
  (b) for each of the at least one group selected in the preceding step, identifying at least one common subset of features of the molecules in each group based at least in part on a measure of how much said at least one common subset of features participated in defining the discriminating features of said group;
  
  (c) for each common subset of features identified in the preceding step, establishing a new data set representing those molecules from the data set that include the common subset of features;
  
  (d) selecting from the new data set at least one group of molecules that have similar feature characteristics and that cooperatively represent a particular activity characteristic, each of said at least one group of molecules having a set of discriminating features defining similarity of the molecules in said group;
  
  (e) for each of the at least one group selected in the preceding step, identifying at least one common subset of features of the molecules in each group based at least in part on a measure of how much said at least one common subset of features participated in defining the discriminating features of said group; and
  
  (f) outputting data indicative of at least one common subset of features.
- View Dependent Claims (49)
- - 49. A computer-readable medium embodying a set of machine language instructions executable by a computer for performing the method steps of claim 48.

50. A processing system for screening a data set representing a plurality of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, the data set defining, for each represented molecule, a feature characteristic and an activity characteristic, the processing system comprising, in combination:
- (a) means for performing the following method steps with respect to the molecules represented by the data set;
  
  (i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step, identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B), establishing a new data set representing the number of molecules, (C) deciding whether to recursively repeat the functions with respect to the new data set, and, if so, (D) repeating from step (i) with respect to the molecules represented by the new data set; and
  
  (b) means for providing a description of at least one new data set established in step (iv), the description including a first segment indicating the feature set for which the new data set was established and a second segment indicating the activity characteristics of the molecules represented by the new data set, whereby the first and second segments may cooperatively establish a correlation between molecular features and activity.

51. A computerized method of converting a set of data representing a plurality of molecules into a data structure representing pharmacophoric mechanisms, the set of data defining respectively for each molecule a structure and an activity characteristic, a node in a data storage medium representing the plurality of molecules, the method comprising, in combination:
- (a) grouping the molecules of the node into a plurality of groups based on structural similarity of the molecules;
  
  (b) selecting one or more of the groups established in element (a) based on the activity characteristics of the molecules in the groups;
  
  (c) for each group selected in element (b), identifying a common substructure among the molecules in the group, the common substructure defining a pharmacophoric mechanism;
  
  (d) for each common substructure identified in element (c), (i) selecting from the molecules of the node at least one molecule that includes the common substructure, and establishing a child node representing the at least one selected molecule;
  
  (ii) determining whether to expand the data structure from the child node, and, if so, repeating the method from step (a) with the node being the child node; and
  
  (e) outputting an indication of at least a portion of the data structure including an indication of at least one pharmacophoric mechanism.
- View Dependent Claims (52, 53, 54, 55, 56, 57, 58, 59)
- - 52. A method as claimed in claim 51, wherein selecting one or more of the groups established in element (a) comprises selecting a plurality of the groups established in element (a).
  - 53. A method as claimed in claim 51, wherein outputting an indication of at least a portion of the data structure comprises outputting a description of the data structure.
  - 54. A method as claimed in claim 53, wherein outputting a description of the data structure comprises providing an output display selected from the group consisting of a graphical display, a textual display, and a combination graphical-textual display.
  - 55. A method as claimed in claim 51, wherein outputting an indication of at least a portion of the data structure comprises outputting a description of at least one node of the data structure.
  - 56. A method as claimed in claim 55, the description of the at least one node comprises information selected from the group consisting of the molecules represented by the node, the common substructure represented by the node, and an activity characteristic measure based on the activity characteristics of the molecules represented by the node.
  - 57. A method as claimed in claim 56, wherein the at least one node comprises a child node stemming from a parent node, and wherein the description of the child node comprises an activity characteristic differential representing a difference in activity level from the parent node to the child node.
  - 58. A method as claimed in claim 51, wherein the common substructure identified in element (c) comprises a substructure selected from the group consisting of a contiguous structure of atoms and bonds and a non-contiguous structure of atoms and bonds.
  - 59. A method as claimed in claim 51, wherein the common substructure identified in element (c) comprises a non-contiguous structure of atoms and bonds.

60. A method for building a multi-domain molecular classifier, the method comprising in combination:
- (a) receiving a set of data representing a set of molecules;
  
  (b) deriving one or more pharmacophores from the set of data, each pharmacophore defining a node of a multi-domain classifier;
  
  (c) using each pharmacophore respectively as a filter to establish a new set of data representing a subset of the molecules, wherein each molecule in the subset includes the pharmacophore; and
  
  (d) deriving one or more new pharmacophores from each new set of data, each new pharmacophore defining a node of the multi-domain classifier.

61. A chemical structure classification method comprising, in combination:
- (a) receiving into a computer a set of data representing a training set of molecules, wherein each molecule of the training set has a feature characteristic and an activity characteristic;
  
  (b) using the training set of molecules to generate a chemical structure classifier by a process comprising;
  
  (i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step, identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new data set consisting of the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new data set, and, if so, (D) repeating steps (i)-(iv) with respect to the molecules represented by the new data set;
  
  (c) applying the chemical structure classifier to classify given molecule into a plurality of structural classes; and
  
  (d) providing as output for presentation to a person an indication of classes into which the given molecule was classified in step (c).
- View Dependent Claims (62, 63)
- - 62. A chemical structure classification method as claimed in claim 61,wherein the chemical structure classifier generated in step (b) comprises a phylogenetic-like tree structure defining a number of nodes beginning with a root node, at least each node after the root node defining a corresponding feature set;
    - and wherein applying the chemical structure classifier to classify the given molecule comprises filtering the given molecule through the tree structure such that the given molecule passes into a given node of the tree structure if the given molecule contains the feature set defined by the given node.
  - 63. A method as claimed in claim 61, wherein applying the multi-domain chemical structure classifier to classify the given molecule comprises filtering data representative of the given molecule through the multi-domain classifier, the method further comprising providing output data indicative of classifications established by the multi-domain classifier for the given molecule.

64. A method of identifying multiple structural classes into which a given molecule fits comprising, in combination:
- representing each of a plurality of molecules by a respective structure characteristic keyed to a set of structural descriptors;
  
  hierarchically clustering representations of the molecules based on their respective structure characteristics, to thereby establish a hierarchical tree structure defining a plurality of nodes, each node representing at least one molecule;
  
  for each of at least a plurality of nodes of the hierarchical tree structure, identifying a respective chemical substructure common to all of the at least one molecule represented by the node, each of at least a plurality of the identified chemical substructures being different than each of the structural descriptors;
  
  filtering a representation of the given molecule through the hierarchical tree structure, the representation of the given molecule thereby falling within a plurality of nodes, wherein, for each given node into which the given molecule falls, the given molecule has the chemical substructure identified for the given node; and
  
  providing as output an indication of the nodes into which the given molecule falls including an indication of the chemical substructure identified for each node into which the given molecule falls, whereby, each node into which the given molecule falls defines a structural class into which the given molecule fits.
- View Dependent Claims (65, 66, 67, 68, 69, 70, 71)
- - 65. A method as claimed in claim 64, wherein the step of identifying a respective chemical substructure for each node of the hierarchical tree is an integral part of the process of hierarchically clustering representations of the molecules to thereby establish the hierarchical tree structure.
  - 66. A method as claimed in claim 64, wherein representing each of a plurality of molecules by a respective structure characteristic keyed to a set of structural descriptors comprises representing each of the molecules in a form selected from the group consisting of (i) a descriptor vector keyed to the structural descriptors and (ii) a 2D graph.
  - 67. A method as claimed in claim 64, wherein hierarchically clustering representations of the molecules based on their respective structure characteristics comprises evaluating similarities between the structure characteristics of the molecules.
  - 68. A method as claimed in claim 67, wherein evaluating similarities between the structure characteristics of the molecules comprises identifying pairs of the molecules and, for each pair, computing a similarity measure between the molecules in the pair.
  - 69. A method as claimed in claim 67, wherein computing the similarity measure comprises computing a measure selected from the group consisting of (i) a Euclidean distance, (ii) a Tanimoto distance, (iii) a Tversky coefficient (iv) a Euclidean-Soergel product, and (v) a Euclidean-Tanimoto product.
  - 70. A method as claimed in claim 64, wherein hierarchically clustering representations of the molecules based on their respective structure characteristics comprises a process selected from the group consisting of (i) divisively clustering the representations and (ii) agglomeratively clustering the representations.
  - 71. A method as claimed in claim 64, wherein each of the structural descriptors is selected from the group consisting of (i) a MACCS key, (ii) a BCI key, and (iii) a Daylight fingerprint key.

72. A software program that implements the method shown in FIG. 2.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Chritodoulos Nicolaou, CVM Equity Fund V. Ltd., LLC, Mary Barr, Michael Grantham, Simulations Plus, Inc., Sommer Udall Hardwick Ahern & Hyatt Profit Sharing Plan LLP, Wat Limited Partnership
Original Assignee
Bioreason, Inc. (Simulations Plus, Inc.)
Inventors
Kelley, Brian P., Nicolaou, Christodoulos A., Bassett, Susan I., Nutt, Ruth F.
Primary Examiner(s)
Starks, Jr., Wilbert L.

Application Number

US09/506,948
Time in Patent Office

1,936 Days
Field of Search

702/22, 422/186, 715/532, 706/46
US Class Current

706/46
CPC Class Codes

G06N 5/02   Knowledge representation; S...

G16C 20/30   Prediction of properties of...

G16C 20/50   Molecular design, e.g. of d...

G16C 20/70   Machine learning, data mini...

Method and system for artificial intelligence directed lead discovery through multi-domain clustering

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

72 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for artificial intelligence directed lead discovery through multi-domain clustering

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

72 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links