Method and system for artificial intelligence directed lead discovery through multi-domain clustering
First Claim
1. A method for screening a set of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, each molecule having a feature characteristic and an activity characteristic, the method comprising, in combination:
- (a) with respect to the molecules;
(i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new set of molecules consisting of the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new set of molecules, and, if so, (D) repeating steps (i)-(iv) with respect to the new set of molecules; and
(b) providing a description of at least one new set of molecules established in step (iv), the description including a first portion indicating the feature set for which the new set of molecules was established and a second portion indicating the activity characteristics of the molecules in the new set of molecules, whereby the first and second portions may cooperatively establish a correlation between molecular features and molecular activity.
4 Assignments
0 Petitions
Accused Products
Abstract
A system for analyzing a vast amount of data representative of chemical structure and activity information and concisely providing conclusions about structure-to-activity relationships. A computer may adaptively learn new substructure descriptors based on its analysis of the input data. The computer may then apply each substructure descriptor as a filter to establish new groups of molecules that match the descriptor. From each new group of molecules, the computer may in turn generate one or more additional new groups of molecules. A result of the analysis in an exemplary arrangement is a tree structure that reflects pharmacophoric information and efficiently establishes through lineage what effect on activity various chemical substructures are likely to have. The tree structure can then be applied as a multi-domain classifier, to help a chemist classify test compounds into structural subclasses.
-
Citations
72 Claims
-
1. A method for screening a set of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, each molecule having a feature characteristic and an activity characteristic, the method comprising, in combination:
-
(a) with respect to the molecules;
(i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new set of molecules consisting of the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new set of molecules, and, if so, (D) repeating steps (i)-(iv) with respect to the new set of molecules; and
(b) providing a description of at least one new set of molecules established in step (iv), the description including a first portion indicating the feature set for which the new set of molecules was established and a second portion indicating the activity characteristics of the molecules in the new set of molecules, whereby the first and second portions may cooperatively establish a correlation between molecular features and molecular activity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. A method for screening a data set representing a plurality of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, the data set defining, for each represented molecule, a feature characteristic and an activity characteristic, the method comprising, in combination:
-
(a) with respect to the molecules represented by the data set;
(i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step, identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new data set representing the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new data set, and, if so, (D) repeating the method from step (i) with respect to the molecules represented by the new data set; and
(b) providing a description of at least one new data set established in step (iv), the description including a first portion indicating the feature set for which the new data set was established and a second portion indicating the activity characteristics of the molecules represented by the new data set, whereby the first and second portion may cooperatively establish a correlation between molecular features and activity. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
-
-
48. A method for screening a data set representing molecules, the data set defining, for each represented molecule, a feature characteristic and an activity characteristic, the method comprising, in combination:
-
(a) selecting from the data set at least one group of molecules that have similar feature characteristics and that cooperatively represent a particular activity characteristic, said group of molecules having a set of discriminating features defining similarity of the molecules in said group;
(b) for each of the at least one group selected in the preceding step, identifying at least one common subset of features of the molecules in each group based at least in part on a measure of how much said at least one common subset of features participated in defining the discriminating features of said group;
(c) for each common subset of features identified in the preceding step, establishing a new data set representing those molecules from the data set that include the common subset of features;
(d) selecting from the new data set at least one group of molecules that have similar feature characteristics and that cooperatively represent a particular activity characteristic, each of said at least one group of molecules having a set of discriminating features defining similarity of the molecules in said group;
(e) for each of the at least one group selected in the preceding step, identifying at least one common subset of features of the molecules in each group based at least in part on a measure of how much said at least one common subset of features participated in defining the discriminating features of said group; and
(f) outputting data indicative of at least one common subset of features. - View Dependent Claims (49)
-
-
50. A processing system for screening a data set representing a plurality of molecules, in order to assist in identifying sets of molecular features that are likely to correlate with specified activity, the data set defining, for each represented molecule, a feature characteristic and an activity characteristic, the processing system comprising, in combination:
-
(a) means for performing the following method steps with respect to the molecules represented by the data set;
(i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step, identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B), establishing a new data set representing the number of molecules, (C) deciding whether to recursively repeat the functions with respect to the new data set, and, if so, (D) repeating from step (i) with respect to the molecules represented by the new data set; and
(b) means for providing a description of at least one new data set established in step (iv), the description including a first segment indicating the feature set for which the new data set was established and a second segment indicating the activity characteristics of the molecules represented by the new data set, whereby the first and second segments may cooperatively establish a correlation between molecular features and activity.
-
-
51. A computerized method of converting a set of data representing a plurality of molecules into a data structure representing pharmacophoric mechanisms, the set of data defining respectively for each molecule a structure and an activity characteristic, a node in a data storage medium representing the plurality of molecules, the method comprising, in combination:
-
(a) grouping the molecules of the node into a plurality of groups based on structural similarity of the molecules;
(b) selecting one or more of the groups established in element (a) based on the activity characteristics of the molecules in the groups;
(c) for each group selected in element (b), identifying a common substructure among the molecules in the group, the common substructure defining a pharmacophoric mechanism;
(d) for each common substructure identified in element (c), (i) selecting from the molecules of the node at least one molecule that includes the common substructure, and establishing a child node representing the at least one selected molecule;
(ii) determining whether to expand the data structure from the child node, and, if so, repeating the method from step (a) with the node being the child node; and
(e) outputting an indication of at least a portion of the data structure including an indication of at least one pharmacophoric mechanism. - View Dependent Claims (52, 53, 54, 55, 56, 57, 58, 59)
-
-
60. A method for building a multi-domain molecular classifier, the method comprising in combination:
-
(a) receiving a set of data representing a set of molecules;
(b) deriving one or more pharmacophores from the set of data, each pharmacophore defining a node of a multi-domain classifier;
(c) using each pharmacophore respectively as a filter to establish a new set of data representing a subset of the molecules, wherein each molecule in the subset includes the pharmacophore; and
(d) deriving one or more new pharmacophores from each new set of data, each new pharmacophore defining a node of the multi-domain classifier.
-
-
61. A chemical structure classification method comprising, in combination:
-
(a) receiving into a computer a set of data representing a training set of molecules, wherein each molecule of the training set has a feature characteristic and an activity characteristic;
(b) using the training set of molecules to generate a chemical structure classifier by a process comprising;
(i) defining groups of the molecules based on similarity of the feature characteristics of the molecules, (ii) selecting one or more of the groups defined in the preceding step based on the activity characteristics of the molecules in the groups, (iii) for each group selected in the preceding step, identifying a feature set common to all molecules in the group, and (iv) for each feature set identified in the preceding step, (A) selecting from the molecules a number of molecules that exhibit the feature set, (B) establishing a new data set consisting of the number of molecules, (C) deciding whether to recursively repeat the method with respect to the new data set, and, if so, (D) repeating steps (i)-(iv) with respect to the molecules represented by the new data set;
(c) applying the chemical structure classifier to classify given molecule into a plurality of structural classes; and
(d) providing as output for presentation to a person an indication of classes into which the given molecule was classified in step (c). - View Dependent Claims (62, 63)
-
-
64. A method of identifying multiple structural classes into which a given molecule fits comprising, in combination:
-
representing each of a plurality of molecules by a respective structure characteristic keyed to a set of structural descriptors;
hierarchically clustering representations of the molecules based on their respective structure characteristics, to thereby establish a hierarchical tree structure defining a plurality of nodes, each node representing at least one molecule;
for each of at least a plurality of nodes of the hierarchical tree structure, identifying a respective chemical substructure common to all of the at least one molecule represented by the node, each of at least a plurality of the identified chemical substructures being different than each of the structural descriptors;
filtering a representation of the given molecule through the hierarchical tree structure, the representation of the given molecule thereby falling within a plurality of nodes, wherein, for each given node into which the given molecule falls, the given molecule has the chemical substructure identified for the given node; and
providing as output an indication of the nodes into which the given molecule falls including an indication of the chemical substructure identified for each node into which the given molecule falls, whereby, each node into which the given molecule falls defines a structural class into which the given molecule fits. - View Dependent Claims (65, 66, 67, 68, 69, 70, 71)
-
-
72. A software program that implements the method shown in FIG. 2.
Specification