Method and apparatus using Bayesian subfamily identification for sequence analysis
First Claim
1. A system for agglomeratively estimating a phylogenetic tree for proteins from input data arrayed to form multiple sequence alignments (MSA), the system including:
- a processing unit that executes a routine to estimate said phylogenetic tree;
memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system carries out the following steps;
a) creating a profile of data represented by each node in a model of said phylogenetic tree;
b) using a symmetrized form of relative entropy to measure distance among nodes between subtrees to determine, at each agglomerative step, which nodes to merge in said model of said phylogenetic tree;
wherein topology of said phylogenetic tree is estimated.
1 Assignment
0 Petitions
Accused Products
Abstract
An system and methodology procedure agglomeratively estimates a phylogenetic tree from MSA input data by creating a data model represented by each tree node by first estimating the number of independent observations in the data. A preferably relative entropy distance measurement made among nodes between subtrees determines which nodes in the model to merge at each agglomeration step. Cuts in the phylogenetic tree are made at points in the agglomeration at which minimized encoding cost is determined, preferably by using Dirichlet mixture densities to assign probabilities to observed amino acids within each subfamily at each position. Using subtree data, a statistical model, e.g., a profile or hidden Markov model, for each subfamily may be constructed in a position-dependent manner, which permits identifying remote homologs in a database search. Further, the invention provides an alignment analysis to identify key functional or structural residues. Finally, the invention may be carried out in automated fashion using a computer system in which a processor unit executes a storable routine embodying the preferred methodology.
59 Citations
43 Claims
-
1. A system for agglomeratively estimating a phylogenetic tree for proteins from input data arrayed to form multiple sequence alignments (MSA), the system including:
-
a processing unit that executes a routine to estimate said phylogenetic tree; memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system carries out the following steps; a) creating a profile of data represented by each node in a model of said phylogenetic tree; b) using a symmetrized form of relative entropy to measure distance among nodes between subtrees to determine, at each agglomerative step, which nodes to merge in said model of said phylogenetic tree; wherein topology of said phylogenetic tree is estimated. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system to determine cuts in a phylogenetic tree, representable as data arrayed to form multiple sequence alignments (MSA), that was agglomeratively formed such that encoding cost is measurable at every point in such agglomeration, the system including:
-
a processing unit that executes a routine to determine cuts to be made in said phylogenetic tree; and memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system uses Dirichlet mixture densities to assign probabilities to individual columns in said MSA to determine at which point in said agglomeration encoding cost is minimized; wherein a cut in said phylogenetic tree is identified. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system that constructs at least one statistical model from data representing decomposition of a phylogenetic tree into subtrees produced using encoding cost measurements, the system including:
-
a processing unit that executes a routine to construct a statistical model from said data; and memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system constructs a statistical model for each subfamily in which subfamilies are equal in number to sequences in a subtree of said phylogenetic tree; wherein said statistical model is created in a position-dependent manner such that for every column in an alignment of data used as input to said system, making a measure whether all subfamilies have chosen a same or a different component of a Dirichlet density, and selecting a component-decision analysis providing a lower encoding cost outcome. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A system that provides position-by-position analysis from multiple sequence alignment (MSA) data, given a subfamily decomposition, the system comprising:
-
a processing unit that executes a routine to provide said position-by-position analysis; and memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system provides, within each subfamily, said position-by-position analysis by computing average conservation across subfamilies at each position in said MSA. - View Dependent Claims (17, 18, 19, 20, 21)
-
-
22. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to agglomeratively estimate a phylogenetic tree from input data arrayed to form multiple sequence alignments (MSA) by:
-
creating a profile of data represented by each node in a model of said phylogenetic tree; and using a symmetrized form of relative entropy to measure distance among nodes between subtrees to determine, at each agglomerative step, which nodes to merge in said model of said phylogenetic tree. - View Dependent Claims (23, 24, 25, 26)
-
-
27. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to determine cuts in a phylogenetic tree, representable as data arrayed to form multiple sequence alignments (MSA), that was agglomeratively formed such that encoding cost is measurable at every point in such agglomeration by:
-
using Dirichlet mixture densities to assign probabilities to individual columns in said MSA to determine at which point in said agglomeration encoding cost is minimized; wherein a cut in said phylogenetic tree is identified. - View Dependent Claims (28, 29, 30, 31, 32)
-
-
33. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to construct at least one statistical model from data representing decomposition of a phylogenetic tree into subtrees produced using encoding cost measurements by:
-
constructing a statistical model from said data, including constructing a statistical model for each subfamily in which subfamilies are equal in number to sequences in a subtree of said phylogenetic tree; wherein said statistical model is created in a position-dependent manner such that for every column in an alignment of data used as input to said system, making a measure whether all subfamilies have chosen a same or a different component of a Dirichlet density, and selecting a component-decision analysis providing a lower encoding cost outcome. - View Dependent Claims (34, 35, 36, 37)
-
-
38. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to provide position-by-position analysis from multiple sequence alignment (MSA) data, given a subfamily decomposition, by:
providing, within each subfamily, said position-by-position analysis by computing average conservation across subfamilies at each position in said MSA. - View Dependent Claims (39, 40, 41, 42, 43)
Specification