METHODS, SYSTEMS, AND SOFTWARE FOR IDENTIFYING FUNCTIONAL BIO-MOLECULES
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention generally relates to methods of rapidly and efficiently searching biologically-related data space. More specifically, the invention includes methods of identifying bio-molecules with desired properties, or which are most suitable for acquiring such properties, from complex bio-molecule libraries or sets of such libraries. The invention also provides methods of modeling sequence-activity relationships. As many of the methods are computer-implemented, the invention additionally provides digital systems and software for performing these methods.
40 Citations
55 Claims
-
1-20. -20. (canceled)
-
21. A method for identifying amino acid residues for variation in a protein variant library in order to affect an activity of interest, the method comprising:
-
(a) receiving, for each protein variant in a training set, an amino acid sequence and the activity of interest obtained from assaying the protein variant; (b) selecting amino acid residues and sequence positions of mutations in the training set; (c) performing regression on the selected amino acid residues, sequence positions, and activities of the training set to produce a sequence-activity model for predicting the activity of interest as a function of multiple independent variables, the sequence-activity model comprising a plurality of linear terms and one or more non-linear terms, wherein, for each non-linear term, the non-linear term comprises a coefficient and two or more dummy independent variables, the coefficient indicates the contribution to the activity of interest by the interaction of the two or more dummy independent variables, and each of the two or more dummy independent variables specifies the presence or absence of a particular residue at a specific sequence position; and (d) using the sequence-activity model to identify one or more amino acid residues at specific positions for varying or fixing to impact the activity of interest, wherein operations (a)-(d) are performed by executing instructions on a computer system programmed to perform said operations. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
-
47. A method for determining the sequence of a protein having an activity of interest, the method comprising:
-
(a) providing, from a client computer to a computer network, an amino acid sequence and a measurement of the activity of interest for each of a plurality of training protein variants in a training set; and (b) receiving, at the client computer from the computer network, a residue type and a sequence position for each of one or more amino acids in at least one selected protein variant, wherein the selected protein variant was identified by a sequence-activity model to have a particular level of the activity of interest, wherein the sequence-activity model comprises a plurality of linear terms and one or more non-linear terms, the linear and non-linear terms are separated by plus or minus signs, each linear term comprises the product of a coefficient and a bit-value independent variable, wherein the coefficient of the linear term indicates the relative impact on activity by the bit-value independent variable, and wherein the bit-value independent variable specifies the presence or absence of only one particular amino acid residue of a specific residue type at a specific sequence position, and each non-linear term is a cross-product term comprising the product of a coefficient and two or more bit-value independent variables, wherein the coefficient of the non-linear term indicates the relative impact on activity by the interaction of the two or more bit-value independent variables, and wherein each of the two or more bit-value independent variables specifies the presence or absence of a particular residue of a specific residue type at a specific sequence position. - View Dependent Claims (48, 49)
-
-
50. A system for performing directed evolution of a protein variant library in order to affect an activity of interest, the system comprising:
-
one or more memory devices configured to store sequence and activity data; and control logic configured to; (a) receiving data characterizing a training set of a protein variant library, wherein the data provides activity and an amino acid sequence for each protein variant in the training set; (b) from the received data, develop a sequence-activity model for predicting activity as a function of multiple independent variables, wherein the sequence activity model comprises a plurality of linear terms and one or more non-linear terms, the linear and non-linear terms are separated by plus or minus signs, each linear term comprises the product of a coefficient and a bit-value independent variable, wherein the coefficient of the linear term indicates the relative impact on activity by the bit-value independent variable, and wherein the bit-value independent variable specifies the presence or absence of only one particular amino acid residue of a specific residue type at a specific sequence position, and each non-linear term is a cross-product term comprising the product of a coefficient and two or more bit-value independent variables, wherein the coefficient of the non-linear term indicates the relative impact on activity by the interaction of the two or more bit-value independent variables, and wherein each of the two or more bit-value independent variables specifies the presence or absence of a particular residue of a specific residue type at a specific sequence position; and (c) using the sequence-activity model to identify one or more amino acid residues at specific positions for varying or fixing to impact the desired activity. - View Dependent Claims (51, 52, 53)
-
-
54. The system of 50, wherein the massively parallel screening apparatus comprises biosensors for detecting reaction product(s) selected from the group consisting of antibodies with reporter properties, and those based on in vivo affinity recognition coupled with expression and activity of a reporter gene.
-
55. The system of 50, wherein receiving data characterizing the training set of the protein variant library comprises receiving data from a client computer through a computer network.
Specification