Method and apparatus using Bayesian subfamily identification for sequence analysis

US 6,128,587 A
Filed: 01/14/1998
Issued: 10/03/2000
Est. Priority Date: 01/14/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A system for agglomeratively estimating a phylogenetic tree for proteins from input data arrayed to form multiple sequence alignments (MSA), the system including:

a processing unit that executes a routine to estimate said phylogenetic tree;

memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system carries out the following steps;

a) creating a profile of data represented by each node in a model of said phylogenetic tree;

b) using a symmetrized form of relative entropy to measure distance among nodes between subtrees to determine, at each agglomerative step, which nodes to merge in said model of said phylogenetic tree;

wherein topology of said phylogenetic tree is estimated.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An system and methodology procedure agglomeratively estimates a phylogenetic tree from MSA input data by creating a data model represented by each tree node by first estimating the number of independent observations in the data. A preferably relative entropy distance measurement made among nodes between subtrees determines which nodes in the model to merge at each agglomeration step. Cuts in the phylogenetic tree are made at points in the agglomeration at which minimized encoding cost is determined, preferably by using Dirichlet mixture densities to assign probabilities to observed amino acids within each subfamily at each position. Using subtree data, a statistical model, e.g., a profile or hidden Markov model, for each subfamily may be constructed in a position-dependent manner, which permits identifying remote homologs in a database search. Further, the invention provides an alignment analysis to identify key functional or structural residues. Finally, the invention may be carried out in automated fashion using a computer system in which a processor unit executes a storable routine embodying the preferred methodology.

59 Citations

View as Search Results

43 Claims

1. A system for agglomeratively estimating a phylogenetic tree for proteins from input data arrayed to form multiple sequence alignments (MSA), the system including:
- a processing unit that executes a routine to estimate said phylogenetic tree;
  
  memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system carries out the following steps;
  
  a) creating a profile of data represented by each node in a model of said phylogenetic tree;
  
  b) using a symmetrized form of relative entropy to measure distance among nodes between subtrees to determine, at each agglomerative step, which nodes to merge in said model of said phylogenetic tree;
  
  wherein topology of said phylogenetic tree is estimated.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein step (a) of said routine creates a model of data represented by each node of said phylogenetic tree using a Bayesian and information theoretic method;
    - wherein data represented by each said node is alignment of sequences descending from said node.
  - 3. The system of claim 1, wherein in determining minimized encoding cost said system examines, for each sub-alignment corresponding to each subfamily, subfamily alignment for each sequence, and encoding for each position of said alignment.
  - 4. The system of claim 1, wherein step (a) of said routine creates a model of data represented by each node of said phylogenetic tree using a Bayesian and information theoretic method to estimate amino acid distributions at every column in said MSA to create a model of all data descending therefrom;
    - wherein data represented by each said node is alignment of sequences descending from said node.
  - 5. The system of claim 1, wherein step (a) of said routine represents data descending from said node using a Bayesian and information theoretic method employing Dirichlet mixture densities as priors over amino acid distributions to estimate posterior amino acid distributions in said model for said data represented by said node.

6. A system to determine cuts in a phylogenetic tree, representable as data arrayed to form multiple sequence alignments (MSA), that was agglomeratively formed such that encoding cost is measurable at every point in such agglomeration, the system including:
- a processing unit that executes a routine to determine cuts to be made in said phylogenetic tree; and
  
  memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system uses Dirichlet mixture densities to assign probabilities to individual columns in said MSA to determine at which point in said agglomeration encoding cost is minimized;
  
  wherein a cut in said phylogenetic tree is identified.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system of claim 6, wherein in determining minimized encoding cost said system uses a position-by-position measure of likelihood of subfamily alignments, given said Dirichlet mixture densities.
  - 8. The system of claim 6, wherein in determining minimized encoding cost said system examines, for each sub-alignment corresponding to each subfamily, subfamily alignment for each seguence, and encoding for each position of said alignment.
  - 9. The system of claim 6, wherein said phylogenetic tree classifies proteins.
  - 10. The system of claim 6, wherein said system computes encoding costs under a first hypothesis that each subfamily chooses a same component of a Dirichlet density, and under a second hypothesis that each subfamily is permitted to chose a different component of a same Dirichlet density independently of other subfamily choices, and selects whichever of said first and second hypothesis provides minimum encoding cost.

11. A system that constructs at least one statistical model from data representing decomposition of a phylogenetic tree into subtrees produced using encoding cost measurements, the system including:
- a processing unit that executes a routine to construct a statistical model from said data; and
  
  memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system constructs a statistical model for each subfamily in which subfamilies are equal in number to sequences in a subtree of said phylogenetic tree;
  
  wherein said statistical model is created in a position-dependent manner such that for every column in an alignment of data used as input to said system, making a measure whether all subfamilies have chosen a same or a different component of a Dirichlet density, and selecting a component-decision analysis providing a lower encoding cost outcome.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11, wherein said statistical model is a profile.
  - 13. The system of claim 11, wherein said statistical model is a hidden Markov model.
  - 14. The system of claim 13, wherein said system further computes amino acid distribution for a corresponding node of said hidden Markov model.
  - 15. The system of claim 13, wherein each said statistical model is created using an estimation of independent counts for each subfamily;
    - wherein said estimation accounts for subfamily alignment in its entirety to weigh subfamily sequences during parameter estimation of said hidden Markov model.

16. A system that provides position-by-position analysis from multiple sequence alignment (MSA) data, given a subfamily decomposition, the system comprising:
- a processing unit that executes a routine to provide said position-by-position analysis; and
  
  memory coupled to said processing unit and storing said routine such that when said routine is executed by said processing unit said system provides, within each subfamily, said position-by-position analysis by computing average conservation across subfamilies at each position in said MSA.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The system of claim 16, wherein said execution of said routine distinguishes general conservation from sub-family specific conservation.
  - 18. The system of claim 16, wherein said routine uses Dirichiet mixture densities to identify positions at which subfamilies have chosen different components of said Dirichlet mixture density;
    - wherein so-identified positions denote variable physico-chemical constraints among subfamilies.
  - 19. The system of claim 16, wherein said routine identifies positions using a Bayesian and information theoretic analysis to yield minimum encoding costs.
  - 20. The system of claim 18, wherein said routine identifies positions using a Bayesian and information theoretic analysis to yield minimum encoding costs, wherein same components of said Dirichaet mixture density are used.
  - 21. The system of claim 18, wherein said routine identifies positions using a Bayesian and information theoretic analysis to yield minimum encoding costs, wherein different components of said Dirichlet mixture density are used.

22. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to agglomeratively estimate a phylogenetic tree from input data arrayed to form multiple sequence alignments (MSA) by:
- creating a profile of data represented by each node in a model of said phylogenetic tree; and
  
  using a symmetrized form of relative entropy to measure distance among nodes between subtrees to determine, at each agglomerative step, which nodes to merge in said model of said phylogenetic tree.
- View Dependent Claims (23, 24, 25, 26)
- - 23. The medium of claim 22, wherein creating said profile is carried out by creating a model of data represented by each node of said phylogenetic tree using a Bayesian and information theoretic method;
    - wherein data represented by each said node is alignment of sequences descending from said node.
  - 24. The medium of claim 22, wherein creating said profile is carried out by creating model of data represented by each node of said phylogenetic tree using a Bayesian and information theoretic method to estimate amino acid distributions at every column in said MSA to create a model of all data descending therefrom;
    - wherein data represented by each said node is alignment of sequences descending from said node.
  - 25. The medium of claim 22, wherein creating said profile is carried out by representing data descending from said node using a Bayesian and information theoretic method employing Dirichet mixture densities as priors over amino acid distributions to estimate posterior amino acid distributions in said model for said data represented by said node.
  - 26. The medium of claim 22, wherein in determining minimized encoding cost is carried out by examining, for each sub-alignment corresponding to each subfamily, subfamily alignment for each sequence, and encoding for each position of said alignment.

27. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to determine cuts in a phylogenetic tree, representable as data arrayed to form multiple sequence alignments (MSA), that was agglomeratively formed such that encoding cost is measurable at every point in such agglomeration by:
- using Dirichlet mixture densities to assign probabilities to individual columns in said MSA to determine at which point in said agglomeration encoding cost is minimized;
  
  wherein a cut in said phylogenetic tree is identified.
- View Dependent Claims (28, 29, 30, 31, 32)
- - 28. The medium of claim 27, wherein in determining minimized encoding cost is carried out using a position-by-position measure of likelihood of subfamily alignments, given said Dirichlet mixture densities.
  - 29. The medium of claim 27, wherein determining minimized encoding cost is carried out by examining, for each sub-alignment corresponding to each subfamily, subfamily alignment for each sequence, and encoding for each position of said alignment.
  - 30. The medium of claim 27, wherein determining minimized encoding cost is carried out by examining, for each sub-alignment corresponding to each subfamily, subfamily alignment for each sequence, and encoding for each position of said alignment.
  - 31. The medium of claim 27, wherein said phylogenetic tree classifies proteins.
  - 32. The medium of claim 27, wherein encoding costs are computed under a first hypothesis that each subfamily chooses a same component of a Dirichlet density, and are computed under a second hypothesis that each subfamily is permitted to chose a different component of a same Dirichlet density independently of other subfamily choices, and whichever of said first and second hypothesis provides minimum encoding cost is selected.

33. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to construct at least one statistical model from data representing decomposition of a phylogenetic tree into subtrees produced using encoding cost measurements by:
- constructing a statistical model from said data, including constructing a statistical model for each subfamily in which subfamilies are equal in number to sequences in a subtree of said phylogenetic tree;
  
  wherein said statistical model is created in a position-dependent manner such that for every column in an alignment of data used as input to said system, making a measure whether all subfamilies have chosen a same or a different component of a Dirichlet density, and selecting a component-decision analysis providing a lower encoding cost outcome.
- View Dependent Claims (34, 35, 36, 37)
- - 34. The medium of claim 33, wherein said statistical model is a profile.
  - 35. The medium of claim 33, wherein said statistical model is a hidden Markov model.
  - 36. The medium of claim 35, wherein amino acid distribution is computed for a corresponding node of said hidden Markov model.
  - 37. The medium of claim 35, wherein:
    - each said statistical model is created using an estimation of independent counts for each subfamily; and
      
      said estimation accounts for subfamily alignment in its entirety to weigh subfamily sequences during parameter estimation of said hidden Markov model.

38. A computer-readable storage medium wherein is located a computer program that causes a computer system having a processor unit to provide position-by-position analysis from multiple sequence alignment (MSA) data, given a subfamily decomposition, by:
- providing, within each subfamily, said position-by-position analysis by computing average conservation across subfamilies at each position in said MSA.
- View Dependent Claims (39, 40, 41, 42, 43)
- - 39. The medium of claim 38, wherein general conservation is distinguished from sub-family specific conservation.
  - 40. The medium of claim 38, wherein Dirichiet mixture densities are used to identify positions at which subfamilies have chosen different components of said Dirichlet mixture density;
    - wherein so-identified positions denote variable physico-chemical constraints among subfamilies.
  - 41. The medium of claim 38, wherein positions are identified using a Bayesian and information theoretic analysis to yield minimum encoding costs.
  - 42. The medium of claim 40, wherein positions are identified using a Bayesian and information theoretic analysis to yield minimum encoding costs, wherein same components of said Dirichlet mixture density are used.
  - 43. The medium of claim 40, wherein positions are identified using a Bayesian and information theoretic analysis to yield minimum encoding costs, wherein different components of said Dirichlet mixture density are used.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
California Institute of Technology
Original Assignee
Regents of the University of California (University of California)
Inventors
Sjolander, Kimmen
Primary Examiner(s)
Teska, Kevin J.
Assistant Examiner(s)
Sergent, Douglas W.

Application Number

US09/006,924
Time in Patent Office

993 Days
Field of Search

395/500.32, 395/500.23, 702/19, 702/20, 702/27, 703/2, 703/11, 382/225, 382/228
US Class Current

703/2
CPC Class Codes

G06F 17/18 for evaluating statistical ...

Method and apparatus using Bayesian subfamily identification for sequence analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

59 Citations

43 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus using Bayesian subfamily identification for sequence analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

43 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links