Exact haplotype reconstruction of F2 populations

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
0Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A system for generating an agglomerate data structure that reduces computation time of one or more information processing systems when reconstructing haplotypes from genotype data in a modelfree setting, the system comprising:
 a memory;
a processor communicatively to the memory; and
a reconstruction circuit communicatively coupled to the memory and processor, the reconstruction circuit;
electronically communicating with at least one external information processing system;
electronically receiving, based on electronically communicating with the at least one external information processing system, a set of progeny genotype data, the set of progeny genotype data comprising n progenies encoded with m biallelic genetic markers, where each of the n progenies are full siblings, wherein each of the m biallelic genetic markers comprises two values, and wherein a combination of each of the m biallelic genetic markers associated with a progeny represents a genotype sequence of the progeny comprising at least a chromosome segment;
constructing, by based on the set of progeny genotype data, an agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies;
reducing computation time of the one or more information processing systems when reconstructing haplotypes from genotype data in a modelfree setting by constructing the agglomerate data structure in linear time, wherein constructing the agglomerate data structure in linear time comprises;
electronically transforming the set of progeny genotype data into a data structure, the data structure encoding the set of progeny genotype data where the encoding represents each of n progenies in the set of progeny genotype data as one row in a plurality of rows of the data structure and represents each of m biallelic genetic markers in the set of progeny genotype data as one column in a plurality of columns of the data, wherein the encoding further orders the plurality of columns corresponding to an order of the m biallelic genetic markers;
electronically accessing the set of progeny genotype data within the data structure;
identifying, based on electronically accessing the set of progeny genotype data, a first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies, the identifying comprisingsplitting the data structure into a first parental data structure and a second parental data structure by transforming a set of data representing genetic contributions for each of a first parent and a second parent of the n progenies and inferring any missing genotype data for one or more of at least one of the n progenies parents, the first parent, or the second parent wherein the transforming comprises utilizing a set of monotonic state transitions that systematically transform the set of data into the first parental data structure and the second parental data structure each comprising different data than the set of data, wherein the first parental data structure encodes haplotypes of a first parent of the n progenies and the second parental data structure encodes haplotypes of a second parent of the n progenies, where the first set of parent haplotypes is identified from the first parental data structure and the second set of parent haplotypes is identified from the second parental data structure;
determining a total minimum number of observable crossovers in the n progenies based on data within the first parental data structure and a second parental data structure;
constructing, based on the set of progeny genotype data and the first and second sets of parent haplotypes, the agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; and
programming a processor of at least one information processing system utilizing the collection of sets of haplotype sequences from the agglomerate data structure to at leastanalyze statistical trends regarding haplotype distribution along chromosomes to observe biases, and determine a specific form of a gene that is associated with disease susceptibility.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for reconstructing haplotypes from genotype data includes a memory, a processor, and a reconstruction module. The reconstruction module is configured to access a set of progeny genotype data including n progenies encoded with m genetic markers. A first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies are identified based on at least the set of progeny genotype data. An agglomerate data structure including a collection of sets of haplotype sequences characterizing the n progenies is constructed based on the set of progeny genotype data and the first and second sets of parent haplotypes. Each set of haplotype sequences includes a number of crossovers equal to a total minimum number of observable crossovers in the n progenies.
9 Citations
No References
ALLELIC DETERMINATION  
Patent #
US 20100256917A1
Filed 06/13/2008

Current Assignee
Stephen James Leslie, Peter James Donnelly, Gilean Mcvean

Sponsoring Entity
Stephen James Leslie, Peter James Donnelly, Gilean Mcvean

Method and apparatus for deriving the genome of an individual  
Patent #
US 20080125978A1
Filed 10/11/2002

Current Assignee
International Business Machines Corporation

Sponsoring Entity
International Business Machines Corporation

System and method for cleaning noisy genetic data and determining chromosome copy number  
Patent #
US 20080243398A1
Filed 03/17/2008

Current Assignee
Natera Incorporated

Sponsoring Entity
Natera Incorporated

GENETIC VARIANTS OF HUMAN INOSITOL POLYPHOSPHATE4PHOSPHATASE, TYPE I (INPP4A) USEFUL FOR PREDICTION AND THERAPY OF IMMUNOLOGICAL DISORDER  
Patent #
US 20070243539A1
Filed 10/25/2006

Current Assignee
Council of Scientific Industrial Research

Sponsoring Entity
Council of Scientific Industrial Research

Process and system for developing a predictive model  
Patent #
US 7,043,461 B2
Filed 05/23/2001

Current Assignee
EGANMANAGED CAPITAL II L.P.

Sponsoring Entity
GENALYTICS INC.

Methods for obtaining and using haplotype data  
Patent #
US 6,931,326 B1
Filed 12/21/2001

Current Assignee
PGxHealth LLC

Sponsoring Entity
Genaissance Pharmaceuticals Incorporated

Linkage analysis using direct and indirect counting  
Patent #
US 20040138824A1
Filed 01/09/2003

Current Assignee
Regents of The University of Minnesota

Sponsoring Entity
Regents of The University of Minnesota

PRECISE SIMULATION OF PROGENY DERIVED FROM RECOMBINING PARENTS  
Patent #
US 20140136161A1
Filed 11/13/2012

Current Assignee
GlobalFoundries Inc.

Sponsoring Entity
GlobalFoundries Inc.

PRECISE SIMULATION OF PROGENY DERIVED FROM RECOMBINING PARENTS  
Patent #
US 20140136166A1
Filed 09/17/2013

Current Assignee
GlobalFoundries Inc.

Sponsoring Entity
International Business Machines Corporation

20 Claims
 1. A system for generating an agglomerate data structure that reduces computation time of one or more information processing systems when reconstructing haplotypes from genotype data in a modelfree setting, the system comprising:
a memory; a processor communicatively to the memory; and a reconstruction circuit communicatively coupled to the memory and processor, the reconstruction circuit; electronically communicating with at least one external information processing system; electronically receiving, based on electronically communicating with the at least one external information processing system, a set of progeny genotype data, the set of progeny genotype data comprising n progenies encoded with m biallelic genetic markers, where each of the n progenies are full siblings, wherein each of the m biallelic genetic markers comprises two values, and wherein a combination of each of the m biallelic genetic markers associated with a progeny represents a genotype sequence of the progeny comprising at least a chromosome segment; constructing, by based on the set of progeny genotype data, an agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; reducing computation time of the one or more information processing systems when reconstructing haplotypes from genotype data in a modelfree setting by constructing the agglomerate data structure in linear time, wherein constructing the agglomerate data structure in linear time comprises; electronically transforming the set of progeny genotype data into a data structure, the data structure encoding the set of progeny genotype data where the encoding represents each of n progenies in the set of progeny genotype data as one row in a plurality of rows of the data structure and represents each of m biallelic genetic markers in the set of progeny genotype data as one column in a plurality of columns of the data, wherein the encoding further orders the plurality of columns corresponding to an order of the m biallelic genetic markers; electronically accessing the set of progeny genotype data within the data structure; identifying, based on electronically accessing the set of progeny genotype data, a first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies, the identifying comprising splitting the data structure into a first parental data structure and a second parental data structure by transforming a set of data representing genetic contributions for each of a first parent and a second parent of the n progenies and inferring any missing genotype data for one or more of at least one of the n progenies parents, the first parent, or the second parent wherein the transforming comprises utilizing a set of monotonic state transitions that systematically transform the set of data into the first parental data structure and the second parental data structure each comprising different data than the set of data, wherein the first parental data structure encodes haplotypes of a first parent of the n progenies and the second parental data structure encodes haplotypes of a second parent of the n progenies, where the first set of parent haplotypes is identified from the first parental data structure and the second set of parent haplotypes is identified from the second parental data structure; determining a total minimum number of observable crossovers in the n progenies based on data within the first parental data structure and a second parental data structure; constructing, based on the set of progeny genotype data and the first and second sets of parent haplotypes, the agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; and programming a processor of at least one information processing system utilizing the collection of sets of haplotype sequences from the agglomerate data structure to at least analyze statistical trends regarding haplotype distribution along chromosomes to observe biases, and determine a specific form of a gene that is associated with disease susceptibility.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
 11. A computer program storage product for generating an agglomerate data structure that reduces computation time of one or more information processing systems when reconstructing haplotypes from genotype data in a modelfree setting, the computer program storage product comprising instructions configured to perform a method comprising:
electronically communicating with at least one external information processing system; electronically receiving, based on electronically communicating with the at least one external information processing system, a set of progeny genotype data, the set of progeny genotype data comprising n progenies encoded with m biallelic genetic markers, where each of the n progenies are full siblings, wherein each of the m biallelic genetic markers comprises two values, and wherein a combination of each of the m biallelic genetic markers associated with a progeny represents a genotype sequence of the progeny comprising at least a chromosome segment; constructing, by based on the set of progeny genotype data, an agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; reducing computation time of the one or more information processing systems by constructing the agglomerate data structure in linear time, wherein constructing the agglomerate data structure in linear time comprises; electronically transforming the set of progeny genotype data into a data structure, the data structure encoding the set of progeny genotype data where the encoding represents each of n progenies in the set of progeny genotype data as one row in a plurality of rows of the data structure and represents each of m biallelic genetic markers in the set of progeny genotype data as one column in a plurality of columns of the data, wherein the encoding further orders the plurality of columns corresponding to an order of the m biallelic genetic markers; electronically accessing the set of progeny genotype data within the data structure; identifying, based on electronically accessing the set of progeny genotype data, a first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies, the identifying comprising splitting the data structure into a first parental data structure and a second parental data structure by transforming a set of data representing genetic contributions for each of a first parent and a second parent of the n progenies and inferring any missing genotype data for one or more of at least one of the n progenies parents, the first parent, or the second parent wherein the transforming comprises utilizing a set of monotonic state transitions that systematically transform the set of data into the first parental data structure and the second parental data structure each comprising different data than the set of data, wherein the first parental data structure encodes haplotypes of a first parent of the n progenies and the second parental data structure encodes haplotypes of a second parent of the n progenies, where the first set of parent haplotypes is identified from the first parental data structure and the second set of parent haplotypes is identified from the second parental data structure; determining a total minimum number of observable crossovers in the n progenies based on data within the first parental data structure and a second parental data structure; constructing, based on the set of progeny genotype data and the first and second sets of parent haplotypes, the agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; and programming a processor of at least one information processing system utilizing the collection of sets of haplotype sequences from the agglomerate data structure to at least analyze statistical trends regarding haplotype distribution along chromosomes to observe biases, and determine a specific form of a gene that is associated with disease susceptibility.  View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
1 Specification
This application is a continuation of and claims priority from U.S. patent application Ser. No. 13/529,476, filed on Jun. 21, 2012, the entire disclosure of which is herein incorporated by reference.
The present invention generally relates to the field of computational biology, and more particularly relates to exact haplotype reconstruction of F2 populations.
An important question when studying the genetics of an individual is that of determining its haplotype organization, i.e., determining the parental origin of each region in the individual'"'"'s genome. When diploid organisms reproduce, crossovers frequently occur during meiosis. Therefore, progenies do not always receive complete copies of their parents'"'"' chromosomes. Instead, the genetic material inherited from a parent is often a combination of segments from the two chromosomes present in that parent, i.e. a combination of the two haplotypes of the parent (and similarly for material inherited from the other parent).
In practice, haplotypes are often constructed from genotype data. The genotype of an individual represents a sequence of unordered pairs of allele values associated with the diploid genome. Genotypes are often sampled as sequences of SNP (singlenucleotide polymorphism) marker values at locations spread across the genome. Determining the haplotypes for an individual at a given marker involves assigning each of the two unordered allele values to the correct parent from which it was inherited, and more precisely, to the correct haplotype of that parent.
Haplotype studies are extensively applied in the field of population genetics. For example, one can reconstruct so called ancestral founder haplotypes and represent the current population as a mosaic of those sequences. In another example, when the diploid parental genomes are known, one can separate them into haplotypes and represent the progenies as a mosaic of those haplotypes. The challenges of such applications include a) accurately collecting genomic data on the studied population, and b) constructing efficient and accurate algorithms for solving the associated combinatorially challenging problems.
In one embodiment, a system for reconstructing haplotypes from genotype data is disclosed. The system comprises a memory and a process communicatively coupled to the memory. A reconstruction module is communicatively coupled to the memory and processor. The reconstruction module is configured to perform a method comprising accessing a set of progeny genotype data comprising n progenies encoded with m genetic markers. A first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies are identified based on at least the set of progeny genotype data. A total minimum number of observable crossovers in the n progenies is determined. An agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies is constructed based on the set of progeny genotype data and the first and second sets of parent haplotypes. Each set of haplotype sequences includes a number of crossovers equal to the total minimum number of observable crossovers in the n progenies.
In another embodiment, a computer program storage product for reconstructing haplotypes from genotype data is disclosed. The computer program storage product comprises instructions configured to perform a method. The method comprises accessing a set of progeny genotype data comprising n progenies encoded with m genetic markers. Each of the m genetic markers comprises two values per individual. Also, a combination of each of the m genetic markers associated with a progeny represents a genotype sequence of the progeny. A first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies are identified based on at least the set of progeny genotype data. A total minimum number of observable crossovers in the n progenies is determined. An agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes is constructed based on the set of progeny genotype data and the first and second sets of parent haplotypes. Each set of progeny haplotype sequences comprises a number of crossovers equal to the total minimum number of observable crossovers in the n progenies.
The accompanying figures where like reference numerals refer to identical or functionally similar elements throughout the separate views, and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention, in which:
Haplotype information can be extremely useful in determining the specific form of a gene that is associated with a phenotype such as disease susceptibility. Haplotype information can be used first to determine the genomic location of a gene associated with the phenotype and second to determine the exact forms (haplotypes) that are associated with the phenotype at that location. Haplotype information is also useful in the field of population genetics.
One problem with constructing haplotype information from genotype data is that given genotypes of n progenies on m loci, the general problem of constructing haplotypes from genotypes is NP complete under different models such as parsimony, maximum likelihood, and phylogeny. However, the duality of biallelic markers and F2 progenies (i.e., two founder sequences per haplotype), as is the case with plant breeding data, provides a compelling combinatoric puzzle. Rather than tweak generic software to adapt to the stringent conditions, embodiments of the present invention solve this problem more accurately and efficiently in a modelfree setting. This is achieved, in one embodiment, by seeking out a mathematical lower bound on the number of crossovers in the data. The focus is not just on a maximum likely or a most parsimonious solution, but on an “agglomerate” of all (equallylikely) feasible solutions. All the computations including the agglomerate can be achieved in linear time. Also, the level of preciseness of the solution is such that it not only permits study of statistical trends, such as haplotype distributions along the chromosome, association with QTLs and so on, but can also help handpick individual recombinants that can be independently confirmed through resequencing for further experimentations.
Operating Environment
As illustrated in
The bus 108 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnects (PCI) bus.
The system memory 106, in one embodiment, comprises a reconstruction module 109 configured to reconstruct haplotypes of F2 populations. As will be discussed in greater detail below, the reconstruction module 109 takes as input a set of gentoypes of n F2 progenies on m markers. Based on this input the reconstruction module 109 extracts all the equallylikely haplotypes of the progenies to produce an agglomerate structure. This agglomerate structure can be utilized by the reconstruction module 109 to compute the parameters of the distribution of each individual crossover, which makes further analysis of haplotype frequencies both more informative as well as accurate. Additionally, the reconstruction module 109 can visualize the solutions (e.g., a collection of haplotype sequences satisfying a lower bound) as mosaics of the ancestor haplotypes. It should be noted that even though
The system memory 106 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 110 and/or cache memory 112. The information processing system 102 can further include other removable/nonremovable, volatile/nonvolatile computer system storage media. By way of example only, a storage system 114 can be provided for reading from and writing to a nonremovable or removable, nonvolatile media such as one or more solid state disks and/or magnetic media (typically called a “hard drive”). A magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk such as a CDROM, DVDROM or other optical media can be provided. In such instances, each can be connected to the bus 108 by one or more data media interfaces. The memory 106 can include at least one program product having a set of program modules that are configured to carry out the functions of an embodiment of the present invention.
Program/utility 116, having a set of program modules 118, may be stored in memory 106 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 118 generally carry out the functions and/or methodologies of embodiments of the present invention.
The information processing system 102 can also communicate with one or more external devices 120 such as a keyboard, a pointing device, a display 122, etc.; one or more devices that enable a user to interact with the information processing system 102; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 102 to communicate with one or more other computing devices. Such communication can occur via I/O interfaces 124. Still yet, the information processing system 102 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 126. As depicted, the network adapter 126 communicates with the other components of information processing system 102 via the bus 108. Other hardware and/or software components can also be used in conjunction with the information processing system 102. Examples include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems.
Haplotype Reconstruction
As discussed above, the reconstruction module 109 reconstructs haplotypes of F2 (the second filial generation resulting from the crossmating or selfmating of a first filial generation) progenies in a set of genotypes.
The following is a more detailed discussion on the haplotype reconstruction process shown in
The jth marker of the ith progeny, also referred to as the position ij, is which is a pair of SNPs (singlenucleotide polymorphisms), or alternatively a pair of any other genomic markers with two possible values per position. Each individual SNP can be accessed as (0) and (1). Thus if ={A,C} or simply written as AC, (0)=A and (1)=C. When (0)=(1), the position ij is said to be homozygous. Similarly, when (0)(1), the position ij is said to be heterozygous.
If d_{t }is the true number of crossovers, then 0≤d_{t}≤2(m−1)n and an estimate d_{c}≥0 is a lower bound of d_{t}, if it is no larger than d_{t}. Consider a scenario where each position ij in is heterozygous. In this case, the theoretical lower bound, on the number of crossovers in the unknown haplotypes of the progenies, is zero. Also there exists an exponentially large number (2^{mn}) of distinct equallylikely haplotype configurations of the progenies each with no crossover. While informative markers in general are chosen for their heterozygosity in data sets, it is observed that the same marker is also homozygous in many a progeny. Thus, in one embodiment, due to Mendelian inheritance, it is very unlikely to have a run of markers that are heterozygous in all the progenies. It is this random sprinkling of homozygous positions in that makes estimation of a nontrivial lower bound possible.
Simulations were conducted to study the relationship between the lower bound estimate d_{c }and extent of homozygosity in in realistic data scenarios. For an algorithmindependent estimation of d_{c }the following criteria was used for detectable crossovers. Let run r between ij and ij′ be the sequence of entries of in the row i, between columns j to j′. Further, r is a heterogenous run if it contains both homozygous and nonhomozygous values. For progeny i, let j_{1}<j_{2 }be such that there is no crossover at ij where j_{1}<j<j_{2 }and at least one of ij_{1} and ij_{2} is a crossover position. Without loss of generality let j_{1 }be a position of crossover, then the crossover at ij_{1} is considered detectable if the run between ij_{1} and ij_{2} is heterogenous. Then the total number of detectable crossovers is d_{c}, the estimate on the lower bound.
Let e denote the mean number of crossovers in a progeny. Simulations were used with values 2≤e≤15. It was observed that for this wide range of crossover profiles, the required fraction of homozygous sites in to get a good estimate of d_{t }(i.e., within 5% of the true value) was fixed. The empirical observation is summarized as follows: (Observation 1) When at least 28% of each subsample of randomly chosen positions in is homozygous, the estimated lower bound d_{c }is within 5% of the exact value d_{t}, i.e., d_{c}≥0.95d_{t}.
In practice, values of n in the range of 50 to 400 and m in the range of 30 to 600 can be encountered. It has been consistently observed that at least 50% of the positions are homozygous and the solutions, obtained using methods discussed below, displayed e≈2. With a decrease in cost and an increase in accessibility of sequencing and genotyping technologies, m can be orders of magnitude larger in the near future. Note that e is not expected to increase with m. Thus, the lower bound estimation scales well with m, since the observation above is independent of m and the algorithm discussed in the later sections is linear in m.
Since each marker is biallelic, each column in has no more than two distinct values. This can be seen in each column of the matrix shown in
When each haplotype is expected to have exactly two founders then a represents one parent and b the other. This condition holds when all the progenies are crossed from exactly one pair of parents. In particular, for a crossed population, V^{a}(0) encodes haplotype 0 of parent a and V^{a}(1) encodes haplotype 1 of parent a. Similarly, V^{b}(0) and V^{b}(1). A lower bound computation also phases the parents a and b, through the appropriate updating of V^{p}, p=a,b.
Let conjugate of z be written as {tilde over (z)}, where z is a matrix or a discrete value as defined below. Note that the conjugate of conjugate of z is z, i.e., z={tilde over ({tilde over (z)})} always holds. For parents p=a,b, if p=a, then {tilde over (p)}=b and viceversa. Similarly, if k=0, then {tilde over (k)}=1 and viceversa. Thus M^{a}={tilde over (M)}^{b }(and M^{b}={tilde over (M)}^{a}). Also, {tilde over (V)}^{p}(0)=V^{p}(1) and {tilde over (V)}^{p}(1)=V^{p}(0) for the parents p=a,b. It should be noted that for readability purposes the superscript a (or b) from M^{a }and V^{a }(or M^{b }and V^{b}) is removed wherever unambiguous. For a pair of markers j and j′ four counts, c_{k}_{1}_{k}_{2}, k_{1},k_{2}=0,1 are computed as:
c_{k}_{1}_{k}_{2}={iM_{ij}=k_{1 }and M_{ij′}=k_{2}}.
Let the four counts in decreasing order be c^{1}≥c^{2}≥c^{3}≥c^{4 }and let C_{jj′}={c^{l}1≤l≤4}. Without loss of generality, for each j, let j_{n×t }be the smallest j′>j such that Σ_{c∈C}_{jj′}c>0. Let xo(j, j_{n×t})=c^{3}+c^{4}. In fact, the minimal number of crossovers (or recombinations) between marker j and j_{n×t }is xo(j, j_{n×t}). Moreover, the following is easily verified: (Observation 2) Given genotype data on n progenies with m biallelic markers, the total number of crossovers in the progenies is ≥Σ_{j=1}^{m}xo(j, j_{n×t})=d_{c}.
In one embodiment, the lower bound can be improved by using appropriate threshold values at different stages. Thus, for a threshold, thrsh_{1}=0.3n, the marker j_{n×t }for a given j is such that the counts satisfy the condition Σc^{1}>thrsh_{1}. This modified definition where j′=j_{n×t }is used in the following discussion. Further, when the quartet of counts c^{1}, c^{2 }and c^{3}, c^{4 }are well separated, i.e., (c^{3}+c^{4})/(c^{1}+c^{2})<thrsh_{2}, say with thrsh_{2}=0.15, then C_{jj′} is referred to as being polarized. Let ={ZZ′}. Then V_{j}^{p}(1), p=a,b, can be computed based on the following observation: (Observation 3) If C_{jj′} is not polarized, then j∈J_{p }and j′∈J_{{tilde over (p)}}, for some p, where J_{p }is the set of markers with exactly one heterozygous parent p.
After V^{p }is updated each M_{ij}^{p}≠X is updated according to V^{p }and each M_{ij}^{p}=X based on
For the pair of markers j and j′, the two larger values (c^{1 }and c^{2}) are either c_{00 }and c_{11 }or c_{01 }and c_{10}. If {c^{1},c^{2}}={c_{01},c_{10}}, then the values V_{j′}(0) and V_{j′}(1) are updated by interchanging them. Recall that the number of crossovers between j and j′ is ≥(c^{3}+c^{4}).
For the pair of markers j and j′, C_{jj′} must be polarized (Observation 4). Hence, when the quartet c^{1}, c^{2 }and c^{3}, c^{4 }is not polarized, then it is reasonable to assume significant experimental errors in the genotyping of marker j′ (or j). To ensure that V_{j′} is not updated multiple times, the reconstruction module 109 begins with the leftmost j where both V_{j }are not homozygous. The marker j′ is picked as above and then at the next stage j is assigned j′ and the process continued until all j have been handled. In one embodiment, j′ on the other side of j can be selected, while taking care not to update V_{j′} multiple times. Furthermore, given , the haplotype matrices M^{a},M^{b }and the parent haplotypes V^{a},V^{b }can be constructed in (mn) time (Observation 5).
It should be noted that sometimes there may be missing values in the data. When the value is missing at position ij, the reconstruction module 109 makes the assignment ←XX. However, if after the computation of V, marker j is homozygous at parent p, then M_{ij}^{p}←H and M_{ij}^{{tilde over (p)}} is appropriately updated. It can be verified that this treatment ensures that the estimated lower bound, d_{c}, is unaltered.
For selfed progenies, not only are the parents the same but even the haplotypes of the parents are deemed identical. Given selfed progenies, one of the following conditions holds for p=a,b and a numeric k∈{0,1}:
V^{a}=V^{b }and if M_{ij}^{p}=k then M_{ij}^{{tilde over (p)}}={tilde over (k)}, for all i and j, or, (1)
V^{a}V^{b }and if M_{ij}^{p}=k then M_{ij}^{{tilde over (p)}}=k, for all i and j. (2)
Measure of Uncertainty
In one embodiment, the reconstruction module 109 determines the preciseness of the output given a genotype matrix . This preciseness measurement is quantified by a triplet of measures , , . is the deviation from the theoretical (algorithmindependent) lower bound. is the extent of dispersion in position of each crossover. is the error in the data. To compute these measures the reconstruction module 109 transforms the matrices M^{p }as follows.
A parent, p, is homozygous at marker j when V_{j}^{p}(0)=V_{j}^{p}(1) and is heterozygous otherwise. A position ij is a heterozygous trio, if the two parents at j are heterozygous and so is . Let k represent a numeric value and two functions Lt(⋅) and Rt(⋅) on an element of the matrix is defined as follows:
A set of transitions, S_→S_ that systematically transform the elements of M is now defined. Let M_{ij}=x. Then the transition S_{x}→S_ is applied to assign the value y to M_{ij }(written as M_{ij}←y) as follows.
In the above transitions e_{k }is a temporary value in matrix M that is not present in the final version of M (i.e., matrix F discussed below). For example, e_{0 }at M_{ij }denotes the case where the closest numerical value to the left and right of M_{ij }is 0 (i.e. haplotype 0 spans over this marker at individual i). Eventually e_{k }is stored as k in M_{ij }[in the S_{e}_{k}→S_{k }transition]. Also, w_{k}_{1}_{k}_{2 }is a nonnumerical value that can occur in the M and F matrices. The possible values are w_{01 }and w_{10}. For example, w_{01 }at M_{ij }represents the case where the closest numerical value to the left of M_{ij }is 0 and the closest numerical value to the right is 1 (there is a crossover at individual i whose location could be j). The values w_{k}_{1}_{k}_{2 }are considered in steps 3 and 4 (as nonnumerical values in F) discussed below with respect to the agglomerate solution R.
To estimate the running time of the algorithm, the transitions are classified as intratransitions and intertransitions, depending on whether M or {tilde over (M)} is used respectively. Thus,
is the only intertransition. The above transitions are applied to the elements of M^{a }and M^{b }till no more transition is applicable. This final form of matrix is written as F(M^{a}) or simply F^{a}. Similarly, F(M^{b}) or simply F^{b}. The permissible transitions are captured in the transition diagram shown in
Each element of F^{p }is in the final state, i.e., for each position ij, F_{ij}^{p}∈{0,1,w_{01},w_{10},−1}, p=a,b. A numeric value of −1 indicates that no information regarding the haplotypes can be deduced. The state transitions induce the following properties on F and V: (Observation 6) For a fixed i, the following hold:
 (1) If {tilde over (F)}_{ij }is not numeric but F_{ij }is, then V_{j}(0)=V_{j}(1) must hold.
 (2) If {tilde over (F)}_{ij }and F_{ij }are both not numeric, then ij, must be a heterozygous trio. The converse however is not true.
 (3) If F_{ij}=−1, for some j, then F_{ij}={tilde over (F)}_{ij}=−1, for all j.
(Observation 7)  (1) Given M^{a }and M^{b}, F^{a }and F^{b }are unique.
 (2) F^{a }and F^{b }can be constructed from M^{a }and M^{b }in (mn) time.
To show that F^{a }and F^{b }are unique, the iterations can be viewed as of two types: one where only intertransitions and the other where only intratransitions are applicable. In the very first iteration, due to the possible start states, no intertransition can be applied. In the second iteration, where no intratransition can be applied, the uniquely applicable intertransitions are applied. Thus the iterations alternate between intra and intertransitions. Hence the final forms F^{a }and F^{b }are unique. Next, since each entry can go through no more than three transitions, it is possible to obtain F^{a }and F^{b }from M^{a }and M^{b }in (mn) time using suitable list data structures.
With respect to the agglomerate of feasible solutions, suppose there are L>0 feasible distinct solutions for a given . The reconstruction module 109, in one embodiment, generates an agglomerate in a single pair of matrices R^{a }and R^{b }that captures all the L solutions. The following illustrates one embodiment how the reconstruction module 109 constructs the agglomerate (R^{a }and R^{b}) based on the encodings in F^{a }and F^{b }discussed above.
Let the conjugate of F be {tilde over (F)} defined as {tilde over (F)}^{a}=F^{b }and {tilde over (F)}^{b}=F^{a}. R_{ij}^{p }that encodes all the possible solutions, is constructed from F_{ij}^{p}, p=a,b, as follows. New symbols *, q and Q are used in addition to the numeric values (0 and 1) in R. Also, for readability purposes the superscript a and b is removed, wherever unambiguous. For each progeny i:
 1. If F_{ij}=−1 for some j, then without loss of generality R_{ij}←0 and {tilde over (R)}_{ij}←1, for all j.
 2. If F_{ij}∈{0,1}, then R_{ij}←F_{ij}.
 3. For each j, if {tilde over (F)}_{ij }is numeric and F_{ij }is not, then R_{ij}←q.
 4. For each j, where F_{ij }and {tilde over (F)}_{ij }are both not numeric, then
The nonnumeric values in R encode multiple solutions, possibly with multiple crossovers, as discussed below.
With respect to the deviation from the lower bound d_{c}, let r be a run of contiguous values in a row (progeny) of R. Further, let r be maximal i.e., if r is nonnumeric then it must be sandwiched on both sides by numeric values. Stated differently, if r is nonnumeric the first and last nonnumeric values of r are adjacent to a numeric value. Let r^{a }and r^{b }be the nonnumeric runs in the two haplotypes of the progeny and are aligned in the examples discussed below. It is possible that only one of r^{a }or r^{b }is a numeric run, and this run is referred to as asymmetric. However, if both are nonnumeric, then by construction r^{a}=r^{b }and the run is symmetric. In the following discussion, the superscripts are removed, wherever unambiguous, for readability purposes.
A switch is defined to occur between q and Q or between Q and the numeric value that sandwiches the run. The number of switches is written as sw(r). The switch detection performed by the reconstruction module 109 is described by the following two examples. In the examples shown below, the run is represented by the nonnumeric values in bold with numeric values to the left and right of the run. Each switch is marked as a numbered downarrow.
In a first example (Example 1) consider the following run:
 r=r^{a}=r^{b}= . . . 000qqQQqqQQ111 . . .
The number of switches is 4 as shown:  . . . 000qq↓_{1}QQ↓_{2}qq↓_{3}QQ↓_{4}111 . . .
The value * is a wild card and can be treated either as q or Q. Thus, the wild card may result in different positions of the same switch.
 r=r^{a}=r^{b}= . . . 000qqQQqqQQ111 . . .
In a second example (Example 2) the following run has three wild cards:
 r=r^{a}=r^{b}= . . . 000q*qQ*QQqq*Q111 . . .
The first wild card is treated as a q, the second wild card is treated as a Q, and the third wild card can be treated as a q or Q with two possible positions for the third switch.  . . . 000q*q↓_{1}Q*QQ↓_{2}qq↓_{3}*↓_{3′}Q↓_{4}111 . . .
The wild card counts of the four switches are 1, 1, 2, and 1 respectively.
 r=r^{a}=r^{b}= . . . 000q*qQ*QQqq*Q111 . . .
When sw(r)>0, the location of the switches in the run r may define additional positions ij in R that can updated from nonnumeric to numeric. These are the positions that are to the left of the leftmost switch and to the right of the rightmost switch positions. Thus, the run of Example 1 is transformed where the length of the nonnumeric run shrinks from 8 to 6.
Similarly the length of run of Example 2 shrinks from 11 to 9.
The transformed values are shown in bold numeric values above. The same process is applied to every nonnumeric run r to transform the matrices R^{a }and R^{b}.
Then (Observation 8) R is such that for each nonnumeric run r, if sw(r)>0, then (1) r begins and ends with Q and (2) the positions of the first and the last switch sandwich the nonnumeric run. To summarize: (Observation 9) Let r be a nonnumeric run and let s=sw(r). Then (1) the total number of crossovers in the run, across both the haplotypes, is exactly s and each crossover in each haplotype of the configuration is at a position of the switch. (2) If r is an asymmetric run, s is zero. In general, s is even and the number of crossovers in each haplotype of each distinct configuration is odd. However, is this not a requirement.
(Observation 10) Let s=sw(r) for a nonnumeric run r. Then if s≤2, there is no additional crossover introduced by r. If s>2, the number of additional crossovers is s−2. Let U be the set of all maximal nonnumeric runs in R. Then if denotes the deviation from the lower bound, d_{c}, the reconstruction module 109 calculates as
Regarding the dispersion index and error estimate, the dispersion index, , is a measure of the uncertainty in the position of the crossovers. This is computed by the reconstruction module 109 as an average over all predicted crossovers. Thus,
When the location of each crossover is known precisely, this index is 0. A value of 1 indicates maximum uncertainty in the location.
If R_{ij},{tilde over (R)}_{ij}∈{0,1}, then the position ij has a mismatch if {V(R_{ij}),{tilde over (V)}({tilde over (R)}_{ij})}. Note that the mismatch at this position could be flagged as an error or one of R_{ij }and {tilde over (R)}_{ij }can be modified to potentially introduce additional crossover(s). These are undetectable during the lower bound computation and do not overlap with the nonnumeric regions. Let N be the number of such mismatches. Then, , which is the error estimate in calculated by the reconstruction module 109, is defined as N/mn. In most instances the number of mismatches is very low (less than 0.01%) and the mismatches are experimental errors. Hence, the convention that such a mismatch be flagged as a potential error has been followed in one embodiment. Then, an error, if any, is at ij in F that underwent the transitions S_{X}→ . . . S_{e}_{k}→S_{k}. Note that the converse is not true. Also, an additional crossover, if any, is at ij in F that underwent the transitions
Again, the converse is not true. To summarize, the trio (, , ) define the uncertainty in the solution of the genotypes .
Haplotype Frequency Distributions
One of the important consequences of haplotype inferencing is the haplotype frequency distribution across the chromosomes. A marker j of progeny i has two SNPs, one derived either from haplotype 0 or haplotype 1 of parent a and the other either from haplotype 0 or haplotype 1 of parent b. Since R is an agglomerate data structure and also contains the nonnumeric values that encode multiple configurations. Based on the encoding in R the reconstruction module 109 estimates the expected value of the frequency count and its variance. First, the reconstruction module enumerates the feasible solutions encoded by a nonnumeric run r. For example, consider the following example (Example 3) where in each of the following the length of the run, len(r), is 3 and the number of additional crossovers is zero. However, the number of configurations is different based on the structure of the switches.
Two runs with sw(r)=0 each:
A run with sw(r)=2:
The 8 distinct configurations for Example 1 discussed above are:
Let wt(r) be the number of distinct solutions encoded by r. (Observation 11) Let s=sw(r) for a run r. Let c_{1}, c_{2}, . . . c_{s }be the wild card counts of the switch positions. Let wt(r) be the number of distinct feasible solutions due to r. Then
EQ. 3 is explained through the following example (Example 4). Consider the following r with sw(r)=6:
 000QqQqQ111⇒000←_{1}Q←_{2}q←_{3}Q←_{4}q←_{5}Q←_{6}111
Note that the 6 switches can be shared by the two haplotypes as 1 and 5 (the first column), or as 3 and 3 (second column) as follows.
 000QqQqQ111⇒000←_{1}Q←_{2}q←_{3}Q←_{4}q←_{5}Q←_{6}111
There are six distinct configurations for the first case and since the two haplotypes can be switched it gives 2×6 distinct configurations. For the second there are
distinct configuration, giving a total of 12+20 distinct solutions due to r.
Based on R, the reconstruction module 109 estimates the expected value of the frequency count of the haplotype pairs and its variance. If R_{ij }is nonnumeric, let R_{ij}=α hold. Thus, for each marker j consider the following, for x, y=0,1,α,
c_{xy}={iR_{ij}^{a}=x and R_{ij}^{b}=y}
In the absence of any other external information, each of the alternative solutions in R is equally likely. Under this assumption, the expected count, ĉ_{k}_{1}_{k}_{2}, of each haplotype pair, k_{1}, k_{2}=0,1, is estimated as follows:
ĉ_{k}_{1}_{k}_{2}=c_{k}_{1}_{k}_{2}+Δ_{k}_{1}_{k}_{2}, where
Δ_{k}_{1}_{k}_{2}=c_{αk}_{2}/2+c_{k}_{1}_{α}/2+c_{αα}/4.
Further, the variance σ_{k}_{1}_{,k}_{2}^{2 }of the count of the haplotype pair is approximated as Δ_{k}_{1}_{k}_{2}.
It should be noted that the reconstruction module 109, in one embodiment, outputs the equallylikely solutions in R as well as the haplotype distribution and variance information to a user. Additionally, the reconstruction module 109 can visualize the solutions as mosaics of the ancestor haplotypes. This information can be output to a display where the user can interact with the information.
Operational Flow Diagrams
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method, or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a nonexhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a readonly memory (ROM), an erasable programmable readonly memory (EPROM or Flash memory), an optical fiber, a portable compact disc readonly memory (CDROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electromagnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user'"'"'s computer, partly on the user'"'"'s computer, as a standalone software package, partly on the user'"'"'s computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user'"'"'s computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention have been discussed above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to various embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.