×

Phasing of unphased genotype data

  • US 9,836,576 B1
  • Filed: 03/13/2013
  • Issued: 12/05/2017
  • Est. Priority Date: 11/08/2012
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method for performing out-of-sample phasing of unphased genotype data of a chromosome pair of a first individual, comprising:

  • under control of one or more computer systems configured with executable instructions,(a) providing a predetermined reference haplotype graph generated from phased genotype data for a set of L different polymorphic genetic markers of a chromosome pair of a plurality of reference individuals,wherein each different polymorphic genetic marker of the set of L different polymorphic genetic markers is located at an associated polymorphic locus on each chromosome of the chromosome pair,wherein L is an integer, the chromosome pair is one pair of human autosomal chromosomes or one pair of human X chromosomes, the plurality of reference individuals comprises at least 100,000 individuals, and the first individual is not included in the plurality of reference individuals, andwherein the predetermined reference haplotype graph comprises;

    a plurality of nodes organized into L+1 levels, the plurality of nodes comprising a first node, a plurality of intermediate nodes, and a terminal node, anda plurality of edges, each edge of the plurality of edges connecting two nodes of the plurality of nodes, wherein all edges that emanate from a node at a first level lead to one or more nodes at a second, next successive, level and represent one polymorphic locus at a first location on each chromosome of the chromosome pair of the plurality of reference individuals, and all edges that emanate from the one or more nodes at the second, next successive, level represent one polymorphic locus at a second location on each chromosome of the chromosome pair of the plurality of reference individuals, the second location being different from the first location and following successively the first location on each chromosome of the chromosome pair,wherein each edge has an associated probability of a particular allele being present at the one polymorphic locus of the chromosome pair of the plurality of reference individuals represented by each such edge;

    (b) receiving unphased genotype data of the first individual for the chromosome pair, the unphased genotype data comprising unphased genotype data for the L different polymorphic genetic markers of the set of L different polymorphic genetic markers; and

    (c) performing out-of-sample phasing on the unphased genotype data of the chromosome pair of the first individual received in (b) using the predetermined reference haplotype graph, wherein performing out-of-sample phasing comprises performing dynamic programming which comprises;

    (1) searching the predetermined reference haplotype graph for a plurality of possible paths through the predetermined reference haplotype graph, each possible path representing a possible haplotype for a chromosome of the chromosome pair of the first individual given the unphased genotype data of the first individual received in (b), wherein each possible path begins on the first node, ends on the terminal node, traverses intermediate nodes and edges between the first node and terminal node, and does not traverse any node more than once, and wherein a probability of each possible path is based on the associated probabilities of all edges in that possible path; and

    (2) identifying two possible paths of the plurality of possible paths of (c)(1) for which (i) a combination of alleles present at each of the polymorphic loci represented by the identified two paths is consistent with alleles present at each of the corresponding polymorphic loci of the unphased genotype data of the chromosome pair of the first individual and (ii) a product of the probability of each of the identified two possible paths having a combination of alleles as recited in (i) is greater than a product of the probability of each of any other two possible paths having the combination of alleles as recited in (i), wherein the identified two possible paths represent a most likely pair of haplotypes for the chromosome pair of the first individual,whereby the unphased genotype data of the chromosome pair of the first individual is phased.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×