Methods for representing sequence-dependent contextual information present in polymer sequences and uses thereof
First Claim
Patent Images
1. A method of representing a polymer sequence, the method comprising:
- obtaining a position vector descriptor (PVD) for one or more positions in the polymer; and
replacing the monomer(s) with the corresponding PVD(s) in the representation of the polymer.
0 Assignments
0 Petitions
Accused Products
Abstract
The invention includes methods of representing polymer sequences in a way that reveals important position-specific contextual information. The representations can be used to determine a number of properties of polymers, such as protein and nucleic acid sequences, including the identification of secondary domain structures, folding rate constants, and the effects of altering (e.g., mutating) monomers. In addition, the representations can be used to compare polymers and thereby identify important structural and functional characteristics of polymers.
-
Citations
28 Claims
-
1. A method of representing a polymer sequence, the method comprising:
-
obtaining a position vector descriptor (PVD) for one or more positions in the polymer; and
replacing the monomer(s) with the corresponding PVD(s) in the representation of the polymer. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of predicting the effects of a change in the sequence of a protein, the method comprising:
-
obtaining a mathematical relationship that predicts the effects of a change in the sequence of a protein, wherein the input variable for the mathematical relationship is the difference between the value of a PVD element corresponding to the changed monomer and the value of a PVD element corresponding to the original monomer, and wherein the two PVD elements are from the same PVD and the PVD represents the position at which the change is located in the protein;
obtaining a PVD representing a position of interest in the protein; and
using (i) the difference between elements of the PVD representing the position of interest in the protein and (ii) the mathematical relationship to calculate the predicted effects of a change in sequence of the protein. - View Dependent Claims (8)
-
-
9. A method of predicting secondary structure boundaries in a protein sequence, the method comprising:
-
obtaining PVDs for some or all amino acid position in the protein sequence;
constructing a leading monomer distribution map (LMDM) for the protein; and
dividing the LMDM into segments representing predicted units of secondary structure.
-
-
11. A method for identifying structural homologs of a protein, the method comprising:
-
obtaining PVDs for some or all amino acid positions in the protein sequence;
determining the effective primary sequence of the protein; and
searching a protein database for sequences homologous to the effective primary sequence of the protein. - View Dependent Claims (12)
-
-
13. A method of identifying positions of contextual similarity in a pair of polymers, the method comprising:
-
a) obtaining a first set of PVDs describing one or more positions in the first polymer and a second set of PVDs describing one or more positions in the second polymer;
b) calculating a difference matrix for the first set of PVDs with respect to the second set of PVDs;
c) identifying the elements in the resulting difference matrix that are within a pre-selected range; and
d) optionally, graphing the identified elements.
-
-
14. A method of identifying positions of contextual similarity in a polymer, the method comprising:
-
a) obtaining a set of PVDs describing one or more positions in the polymer, wherein the set of PVDs has been simplified to include a reduced number of elements, X;
b) performing pair-wise comparisons of each PVD (CLXPVD) from the set of PVDs, wherein two PVDs that have a threshold number, t, of CLMs in common are identified as representing monomer positions that are contextually similar; and
,c) optionally, generating a matrix (E-MAAP™
) representing the results of step (b). - View Dependent Claims (15)
-
-
16. A method of identifying proteins that have similar structural folds, the method comprising:
-
obtaining a first scaled E-MAAP™
, wherein the E-MAAP™
is scaled using amino acid cohesion energies;
obtaing a second scaled E-MAAP™
, wherein the E-MAAP™
is scaled using amino acid cohesion energies, and wherein the polymer sequence of the second scaled E-MAAP™
is different from the polymer sequence of the first scaled E-MAAP™
; and
determining the similarity of the second scaled E-MAAP™
with respect to the first scaled E-MAAP™
. - View Dependent Claims (10, 17)
-
-
18. A method of estimating the folding rate of a protein, the method comprising:
-
obtaining a scaled E-MAAPTM, wherein the E-MAAP™
is scaled using the Richardson hydrophobicity scale;
making a three-dimensional representation of the scaled E-MAAP™
;
integrating the positive volume of the three-dimensional representation;
and using the value resulting from the integration to estimate the folding rate of the protein.
-
-
19. A method of identifying positions of contextual similarity in a pair of polymers, the method comprising:
-
a) obtaining a first set of PVDs describing one or more positions in the first polymer and a second set of PVDs describing one or more positions in the second polymer, wherein the PVDs of the first and second set of PVDs have been simplified to include a limited number of elements, X;
b) performing pairwise comparisons of each PVD (CLXPVD) from the first set of PVDs with each PVD (CLXPVD) from the second set of PVDs, wherein two PVDs that have a threshold number, t, of CLMs in common are identified as representing monomer positions that are contextually similar; and
,c) optionally, generating a matrix (E-MAAP™
) representing the results of step (b). - View Dependent Claims (20, 21)
-
-
22. A method of representing a polymer sequence, the method comprising:
-
obtaining a PVD representing a position in the polymer sequence; and
using the elements of the PVD to construct a Context Functional Surface (CFS) for one or more positions in the polymer sequence. - View Dependent Claims (23)
-
-
24. A method of characterizing secondary structure segments in a protein, the method comprising:
-
a) obtaining a PVD representing a particular monomer position, R, in the protein;
b) using the PVD of step a) to generate a CFS for some or all monomer positions in the polymer;
c) plotting the positive values of the CFSs of step b) on a single graph to produce a G-profile; and
d) analyzing the G-profile.
-
-
25. A method of characterizing the contextual similarity of different positions in a polymer, the method comprising:
-
a) obtaining a PVD representing a particular monomer position, R, in the polymer;
b) using the PVD to generate a set of CFSs for some or all positions in the polymer;
c) calculating an correlation matrix, rR, for the set of CFSs generated in step b);
d) repeating steps a) through c) for some or all positions, R, in the polymer; and
e) using the correlation matrices of step d) to generate a GCD for the polymer.
-
-
26. A method of identifying contextually unique positions in a polymer, the method comprising:
-
obtaining a GCD for the polymer; and
identifying elements in the GCD that are greater than or equal to a predetermined threshold value; and
identifying correlated islands in the set of GCD elements identified as exceeding the threshold value.
-
-
27. A method of predicting the effects of mutations on the structure of a protein, the method comprising:
-
a) obtaining a GCD for the protein;
b) identifying a position P in the GCD;
c) identifying a position R in the GCD;
d) plotting the row vector of the GCD at position P and the column vector of the GCD at position R on the same graph; and
e) identifying peaks in the graph, thereby identifying positions in the protein that are predicted to disrupt the structural stability of the protein when mutated.
-
-
28. The method of identifying positions in a nucleic acid sequence, the method comprising:
-
a) obtaining a GCD for a protein encoded by the nucleic acid sequence;
b) identifying a position P in the GCD;
c) identifying a position R in the GCD;
d) plotting the row vector of the GCD at position P and the column vector of the GCD at position R on the same graph; and
e) identifying positions in the graph corresponding to positions in the protein that are predicted to influence the structural stability of the protein; and
f) identifying regions of the nucleic acid sequence that encode the amino acids identified in step e), thereby identifying positions in the nucleic acid sequence that are likely to contain SNPs.
-
Specification