Exact haplotype reconstruction of F2 populations

US 10,460,832 B2
Filed: 02/07/2013
Issued: 10/29/2019
Est. Priority Date: 06/21/2012
Status: Active Grant

First Claim

Patent Images

1. A system for generating an agglomerate data structure that reduces computation time of one or more information processing systems when reconstructing haplotypes from genotype data in a model-free setting, the system comprising:

a memory;

a processor communicatively to the memory; and

a reconstruction circuit communicatively coupled to the memory and processor, the reconstruction circuit;

electronically communicating with at least one external information processing system;

electronically receiving, based on electronically communicating with the at least one external information processing system, a set of progeny genotype data, the set of progeny genotype data comprising n progenies encoded with m biallelic genetic markers, where each of the n progenies are full siblings, wherein each of the m biallelic genetic markers comprises two values, and wherein a combination of each of the m biallelic genetic markers associated with a progeny represents a genotype sequence of the progeny comprising at least a chromosome segment;

constructing, by based on the set of progeny genotype data, an agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies;

reducing computation time of the one or more information processing systems when reconstructing haplotypes from genotype data in a model-free setting by constructing the agglomerate data structure in linear time, wherein constructing the agglomerate data structure in linear time comprises;

electronically transforming the set of progeny genotype data into a data structure, the data structure encoding the set of progeny genotype data where the encoding represents each of n progenies in the set of progeny genotype data as one row in a plurality of rows of the data structure and represents each of m biallelic genetic markers in the set of progeny genotype data as one column in a plurality of columns of the data, wherein the encoding further orders the plurality of columns corresponding to an order of the m biallelic genetic markers;

electronically accessing the set of progeny genotype data within the data structure;

identifying, based on electronically accessing the set of progeny genotype data, a first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies, the identifying comprisingsplitting the data structure into a first parental data structure and a second parental data structure by transforming a set of data representing genetic contributions for each of a first parent and a second parent of the n progenies and inferring any missing genotype data for one or more of at least one of the n progenies parents, the first parent, or the second parent wherein the transforming comprises utilizing a set of monotonic state transitions that systematically transform the set of data into the first parental data structure and the second parental data structure each comprising different data than the set of data, wherein the first parental data structure encodes haplotypes of a first parent of the n progenies and the second parental data structure encodes haplotypes of a second parent of the n progenies, where the first set of parent haplotypes is identified from the first parental data structure and the second set of parent haplotypes is identified from the second parental data structure;

determining a total minimum number of observable crossovers in the n progenies based on data within the first parental data structure and a second parental data structure;

constructing, based on the set of progeny genotype data and the first and second sets of parent haplotypes, the agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; and

programming a processor of at least one information processing system utilizing the collection of sets of haplotype sequences from the agglomerate data structure to at leastanalyze statistical trends regarding haplotype distribution along chromosomes to observe biases, and determine a specific form of a gene that is associated with disease susceptibility.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for reconstructing haplotypes from genotype data includes a memory, a processor, and a reconstruction module. The reconstruction module is configured to access a set of progeny genotype data including n progenies encoded with m genetic markers. A first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies are identified based on at least the set of progeny genotype data. An agglomerate data structure including a collection of sets of haplotype sequences characterizing the n progenies is constructed based on the set of progeny genotype data and the first and second sets of parent haplotypes. Each set of haplotype sequences includes a number of crossovers equal to a total minimum number of observable crossovers in the n progenies.

Citations

20 Claims

1. A system for generating an agglomerate data structure that reduces computation time of one or more information processing systems when reconstructing haplotypes from genotype data in a model-free setting, the system comprising:
- a memory;
  
  a processor communicatively to the memory; and
  
  a reconstruction circuit communicatively coupled to the memory and processor, the reconstruction circuit;
  
  electronically communicating with at least one external information processing system;
  
  electronically receiving, based on electronically communicating with the at least one external information processing system, a set of progeny genotype data, the set of progeny genotype data comprising n progenies encoded with m biallelic genetic markers, where each of the n progenies are full siblings, wherein each of the m biallelic genetic markers comprises two values, and wherein a combination of each of the m biallelic genetic markers associated with a progeny represents a genotype sequence of the progeny comprising at least a chromosome segment;
  
  constructing, by based on the set of progeny genotype data, an agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies;
  
  reducing computation time of the one or more information processing systems when reconstructing haplotypes from genotype data in a model-free setting by constructing the agglomerate data structure in linear time, wherein constructing the agglomerate data structure in linear time comprises;
  
  electronically transforming the set of progeny genotype data into a data structure, the data structure encoding the set of progeny genotype data where the encoding represents each of n progenies in the set of progeny genotype data as one row in a plurality of rows of the data structure and represents each of m biallelic genetic markers in the set of progeny genotype data as one column in a plurality of columns of the data, wherein the encoding further orders the plurality of columns corresponding to an order of the m biallelic genetic markers;
  
  electronically accessing the set of progeny genotype data within the data structure;
  
  identifying, based on electronically accessing the set of progeny genotype data, a first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies, the identifying comprisingsplitting the data structure into a first parental data structure and a second parental data structure by transforming a set of data representing genetic contributions for each of a first parent and a second parent of the n progenies and inferring any missing genotype data for one or more of at least one of the n progenies parents, the first parent, or the second parent wherein the transforming comprises utilizing a set of monotonic state transitions that systematically transform the set of data into the first parental data structure and the second parental data structure each comprising different data than the set of data, wherein the first parental data structure encodes haplotypes of a first parent of the n progenies and the second parental data structure encodes haplotypes of a second parent of the n progenies, where the first set of parent haplotypes is identified from the first parental data structure and the second set of parent haplotypes is identified from the second parental data structure;
  
  determining a total minimum number of observable crossovers in the n progenies based on data within the first parental data structure and a second parental data structure;
  
  constructing, based on the set of progeny genotype data and the first and second sets of parent haplotypes, the agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; and
  
  programming a processor of at least one information processing system utilizing the collection of sets of haplotype sequences from the agglomerate data structure to at leastanalyze statistical trends regarding haplotype distribution along chromosomes to observe biases, and determine a specific form of a gene that is associated with disease susceptibility.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein the method further comprises:
    - determining a preciseness of the agglomerate data structure in terms of statistical measures.
  - 3. The system of claim 1, wherein the data structure is a matrix I comprising rows i and columns j, wherein each row i, 1≤
    - i≤
      
      n represents a progeny in the n progenies and each column j, 1≤
      
      j≤
      
      m represents a marker in the m biallelic genetic markerssplitting the first data structure comprises;
      
      constructing matrices M_ij^p, where
  - 4. The system of claim 3, wherein V_j^p(1), p=a,b, is computed based on j∈
    - J_pand j′
      
      ∈
      
      J_p, for some p, where J_pis a set of markers with exactly one heterozygous parent, where for parents p=a,b, if p=a, then p=b and vice-versa, and where, if k=0, then {tilde over (k)}=1 and vice-versa, and where M^a={tilde over (M)}^b(and M^b={tilde over (M)}^a), and, V^p(0)=V^p(1) and V^p(1)=V^p(0) for parents p=a,b.
  - 5. The system of claim of 4, further comprising updating each M_ij^p≠
    - X according to V^pand each M_ij^p=X based on
  - 6. The system of claim 5, wherein the method further comprises:
    - transforming each of the updated matrices M_ij^pto F(M^a) and F(M^b), respectively, using a set of monotonic state transitions based on functions Lt(M_ij) and Rt(M_ij), where
  - 7. The system of claim 6, wherein R_ij^pis generated based on, for each progeny i:
    - if F_ij=−
      
      1 for some j, then R_ij←
      
      0 and R_ij←
      
      1, for all j;
      
      if F_ij∈
      
      {0,1}, then R_ij←
      
      F_ij;
      
      for each j, if F_ijis numeric and F_ijis not, then R_ij←
      
      q; and
      
      for each j, where F_ijand F_ijare both not numeric, then
  - 8. The system of claim 7, wherein the method further comprises:
    - determining an error estimate E₁ in I as N/mn, where N is a number of genotype data mismatches in R_ij^p, where a position ij has a mismatch if {V(R_ij),V(R_ij)}≠
      
      I_ijand R_ij,R_ij∈
      
      {0,1}.
  - 9. The system of claim of claim 7, wherein the method further comprises:
    - determining a haplotype frequency distribution for the collection of the set of haplotype sequences in the agglomerate data structure, wherein the determining is based onk₁,k₂=0,1 being a haplotype pair, andif R_ijis non-numeric, let R_ij=α
      
      hold, where for each marker j for x,y=0,1,α
      
      , c_xy=|{i|R_ij^α=x and R_ij^b=y}|, and an expected count ĉ
      
      _k₁_k₂of each haplotype pair is ĉ
      
      _k₁_k₂=c_k₁_k₂+Δ
      
      _k₁_k₂where Δ
      
      _k₁_k₂=c_α
      
      k₂/2+c_k₁_α/2+c_α
      
      α/4.
  - 10. The system of claim 9, wherein the method further comprises:
    - determining a variance σ
      
      _k₁_k₂²of haplotype frequency distribution as Δ
      
      _k₁_k₂.

11. A computer program storage product for generating an agglomerate data structure that reduces computation time of one or more information processing systems when reconstructing haplotypes from genotype data in a model-free setting, the computer program storage product comprising instructions configured to perform a method comprising:
- electronically communicating with at least one external information processing system;
  
  electronically receiving, based on electronically communicating with the at least one external information processing system, a set of progeny genotype data, the set of progeny genotype data comprising n progenies encoded with m biallelic genetic markers, where each of the n progenies are full siblings, wherein each of the m biallelic genetic markers comprises two values, and wherein a combination of each of the m biallelic genetic markers associated with a progeny represents a genotype sequence of the progeny comprising at least a chromosome segment;
  
  constructing, by based on the set of progeny genotype data, an agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies;
  
  reducing computation time of the one or more information processing systems by constructing the agglomerate data structure in linear time, wherein constructing the agglomerate data structure in linear time comprises;
  
  electronically transforming the set of progeny genotype data into a data structure, the data structure encoding the set of progeny genotype data where the encoding represents each of n progenies in the set of progeny genotype data as one row in a plurality of rows of the data structure and represents each of m biallelic genetic markers in the set of progeny genotype data as one column in a plurality of columns of the data, wherein the encoding further orders the plurality of columns corresponding to an order of the m biallelic genetic markers;
  
  electronically accessing the set of progeny genotype data within the data structure;
  
  identifying, based on electronically accessing the set of progeny genotype data, a first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies, the identifying comprisingsplitting the data structure into a first parental data structure and a second parental data structure by transforming a set of data representing genetic contributions for each of a first parent and a second parent of the n progenies and inferring any missing genotype data for one or more of at least one of the n progenies parents, the first parent, or the second parent wherein the transforming comprises utilizing a set of monotonic state transitions that systematically transform the set of data into the first parental data structure and the second parental data structure each comprising different data than the set of data, wherein the first parental data structure encodes haplotypes of a first parent of the n progenies and the second parental data structure encodes haplotypes of a second parent of the n progenies, where the first set of parent haplotypes is identified from the first parental data structure and the second set of parent haplotypes is identified from the second parental data structure;
  
  determining a total minimum number of observable crossovers in the n progenies based on data within the first parental data structure and a second parental data structure;
  
  constructing, based on the set of progeny genotype data and the first and second sets of parent haplotypes, the agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; and
  
  programming a processor of at least one information processing system utilizing the collection of sets of haplotype sequences from the agglomerate data structure to at leastanalyze statistical trends regarding haplotype distribution along chromosomes to observe biases, and determine a specific form of a gene that is associated with disease susceptibility.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computer program storage product of claim 11, wherein the method further comprises:
    - determining a preciseness of the agglomerate data structure in terms of statistical measures.
  - 13. The computer program storage product of claim 11, wherein the data structure is a matrix I comprising rows i and columns j, wherein each row i, 1≤
    - i≤
      
      n represents a progeny in the n progenies and each column j, 1≤
      
      j≤
      
      m represents a marker in the m biallelic genetic markers wherein splitting the first data structure comprises;
      
      constructing matrices M_ij^p, where
  - 14. The computer program storage product of claim 13, wherein V_j^p(1), p=a,b, is computed based on j∈
    - J_pand j′
      
      ∈
      
      J_p, for some p, where J_pis a set of markers with exactly one heterozygous parent, wherefor parents p=a,b, if p=a, then p=b and vice-versa, and where, if k=0, then {tilde over (k)}=1 and vice-versa, and where M^a={tilde over (M)}^b(and M^b={tilde over (M)}^a), and, V^p(0)=V^p(1) and V^p(1)=V^p(0) for parents p=a,b.
  - 15. The computer program storage product of claim of 14, further comprising updating each M_ij^p≠
    - X according to V^pand each M_ij^p=X based on
  - 16. The computer program storage product of claim 15, wherein the method further comprises:
    - transforming each of the updated matrices M_ij^pto F(M^a) and F(M^b), respectively, using a set of monotonic state transitions based on functions Lt(M_ij) and Rt(M_ij), where
  - 17. The computer program storage product of claim 16, wherein R_ij^pis generated based on, for each progeny i:
    - if F_ij=−
      
      1 for some j, then R_ij←
      
      0 and R_ij←
      
      1, for all j;
      
      if F_ij∈
      
      {0,1}, then R_ij←
      
      F_ij;
      
      for each j, if F_ijis numeric and F_ijis not, then R_ij←
      
      q; and
      
      for each j, where F_ijand F_ijare both not numeric, then
  - 18. The computer program storage product of claim 17, wherein the method further comprises:
    - determining an error estimate E_Iin I as N/mn, where N is a number of genotype data mismatches in R_ij^p, where a position ij has a mismatch if {V(R_ij),V(R_ij)}I_ijand R_ij,R_ij∈
      
      {0,1}.
  - 19. The computer program storage product of claim of claim 17, wherein the method further comprises:
    - determining a haplotype frequency distribution for the collection of the set of haplotype sequences in the agglomerate data structure, wherein the determining is based onk₁,k₂=0,1 being a haplotype pair, andif R_ijis non-numeric, let R_ij=α
      
      hold, where for each marker j for x,y=0,1,α
      
      , c_xy=|{i|R_ij^α=x and R_ij^b=y}|, and an expected count ĉ
      
      _k₁_k₂of each haplotype pair is ĉ
      
      _k₁_k₂=c_k₁_k₂Δ
      
      _k₁_k₂, where Δ
      
      _k₁_k₂=c_α
      
      k₂/2+c_k₁_α/2+c_α
      
      α/4.
  - 20. The computer program storage product of claim 19, wherein the method further comprises:
    - determining a variance σ
      
      _k₁_,k₂²of haplotype frequency distribution as Δ
      
      _k₁_k₂.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Haiminen, Niina S., Parida, Laxmi P., Utro, Filippo
Primary Examiner(s)
Woitach, Joseph

Application Number

US13/761,730
Publication Number

US 20130345987A1
Time in Patent Office

2,455 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 17/16   Matrix or vector computatio...

G16B 20/00   ICT specially adapted for f...

G16B 20/20   Allele or variant detection...

G16B 45/00   ICT specially adapted for b...

Exact haplotype reconstruction of F2 populations

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Exact haplotype reconstruction of F2 populations

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links