×

Exact haplotype reconstruction of F2 populations

  • US 10,460,832 B2
  • Filed: 02/07/2013
  • Issued: 10/29/2019
  • Est. Priority Date: 06/21/2012
  • Status: Active Grant
First Claim
Patent Images

1. A system for generating an agglomerate data structure that reduces computation time of one or more information processing systems when reconstructing haplotypes from genotype data in a model-free setting, the system comprising:

  • a memory;

    a processor communicatively to the memory; and

    a reconstruction circuit communicatively coupled to the memory and processor, the reconstruction circuit;

    electronically communicating with at least one external information processing system;

    electronically receiving, based on electronically communicating with the at least one external information processing system, a set of progeny genotype data, the set of progeny genotype data comprising n progenies encoded with m biallelic genetic markers, where each of the n progenies are full siblings, wherein each of the m biallelic genetic markers comprises two values, and wherein a combination of each of the m biallelic genetic markers associated with a progeny represents a genotype sequence of the progeny comprising at least a chromosome segment;

    constructing, by based on the set of progeny genotype data, an agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies;

    reducing computation time of the one or more information processing systems when reconstructing haplotypes from genotype data in a model-free setting by constructing the agglomerate data structure in linear time, wherein constructing the agglomerate data structure in linear time comprises;

    electronically transforming the set of progeny genotype data into a data structure, the data structure encoding the set of progeny genotype data where the encoding represents each of n progenies in the set of progeny genotype data as one row in a plurality of rows of the data structure and represents each of m biallelic genetic markers in the set of progeny genotype data as one column in a plurality of columns of the data, wherein the encoding further orders the plurality of columns corresponding to an order of the m biallelic genetic markers;

    electronically accessing the set of progeny genotype data within the data structure;

    identifying, based on electronically accessing the set of progeny genotype data, a first set of parent haplotypes associated with a first parent of the n progenies and a second set of parent haplotypes associated with a second parent of the n progenies, the identifying comprisingsplitting the data structure into a first parental data structure and a second parental data structure by transforming a set of data representing genetic contributions for each of a first parent and a second parent of the n progenies and inferring any missing genotype data for one or more of at least one of the n progenies parents, the first parent, or the second parent wherein the transforming comprises utilizing a set of monotonic state transitions that systematically transform the set of data into the first parental data structure and the second parental data structure each comprising different data than the set of data, wherein the first parental data structure encodes haplotypes of a first parent of the n progenies and the second parental data structure encodes haplotypes of a second parent of the n progenies, where the first set of parent haplotypes is identified from the first parental data structure and the second set of parent haplotypes is identified from the second parental data structure;

    determining a total minimum number of observable crossovers in the n progenies based on data within the first parental data structure and a second parental data structure;

    constructing, based on the set of progeny genotype data and the first and second sets of parent haplotypes, the agglomerate data structure comprising a collection of sets of haplotype sequences characterizing the n progenies in terms of the first and second sets of parent haplotypes, wherein each set of haplotype sequences comprises a number of observable crossovers equal to the total minimum number of observable crossovers in the n progenies; and

    programming a processor of at least one information processing system utilizing the collection of sets of haplotype sequences from the agglomerate data structure to at leastanalyze statistical trends regarding haplotype distribution along chromosomes to observe biases, and determine a specific form of a gene that is associated with disease susceptibility.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×