Systems and methods for encoding genetic variation for a population
First Claim
1. A method of encoding variation data for a population, comprising using at least one computer hardware processor connected to at least one non-transitory computer-readable storage medium to perform:
- receiving information describing genetic variation of a population of individuals, the information comprising a plurality of variable sites within a reference genome of the population and a plurality of genotypes of a plurality of individuals in the population with respect to those variable sites;
determining a prevalence for each variable site within the population, wherein the prevalence comprises the frequency at which alternative alleles of a given variable site occur in the population;
selecting an encoding strategy for each of the plurality of variable sites based on the determined prevalence of each variable site across the population, wherein if the prevalence for a variable site is less than 10%, less than 5%, less than 1%, or less than 0.1% of the population, the encoding strategy is a compression encoding strategy, and otherwise the encoding strategy is a bit field encoding strategy;
encoding the information according to the encoding strategy selected for each of the plurality of variable sites; and
storing the encoded information in the at least one non-transitory computer-readable storage medium.
12 Assignments
0 Petitions
Accused Products
Abstract
In one embodiment, a method of encoding variation data for a population comprises receiving, by a variant encoding engine executing on a processor, information describing genetic variation of a population of individuals. The information comprises a plurality of variable sites within the reference genome of the population and the genotypes of a plurality of individuals in the population with respect to those variable sites. The method further comprises selecting an encoding strategy for the information based on the characteristics of the genetic variation across the population, and encoding the information according to the selected encoding strategy. In certain embodiments, selecting an encoding strategy may comprise determining the variability of a variable site within the population, and encoding information associated with the variable site based on the variability.
112 Citations
11 Claims
-
1. A method of encoding variation data for a population, comprising using at least one computer hardware processor connected to at least one non-transitory computer-readable storage medium to perform:
-
receiving information describing genetic variation of a population of individuals, the information comprising a plurality of variable sites within a reference genome of the population and a plurality of genotypes of a plurality of individuals in the population with respect to those variable sites; determining a prevalence for each variable site within the population, wherein the prevalence comprises the frequency at which alternative alleles of a given variable site occur in the population; selecting an encoding strategy for each of the plurality of variable sites based on the determined prevalence of each variable site across the population, wherein if the prevalence for a variable site is less than 10%, less than 5%, less than 1%, or less than 0.1% of the population, the encoding strategy is a compression encoding strategy, and otherwise the encoding strategy is a bit field encoding strategy; encoding the information according to the encoding strategy selected for each of the plurality of variable sites; and storing the encoded information in the at least one non-transitory computer-readable storage medium. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for encoding variation data for a population, comprising:
-
a memory, storing information describing genetic variation of a population of individuals, the information comprising a plurality of variable sites within a reference genome of the population and a plurality of genotypes of the plurality of individuals in the population with respect to those variable sites; and a processor configured to; for each variable site in the population; determine the variability of the variable site in the population; and encode the information associated with the variable site based on the frequency of alternate alleles of the variable site occurring in the population, wherein a variable site having more than 10%, more than 5%, more than 1%, or more than 0.1% frequency of alternate alleles is encoded using a bit field encoding strategy, and otherwise the variable site is encoded using a run length encoding strategy; and store the encoded information in the memory.
-
-
11. A method of encoding variation data for a population, comprising using at least one computer hardware processor connected to at least one non-transitory computer-readable storage medium to perform:
-
receiving information describing genetic variation of a population of individuals, the information comprising a plurality of variable sites within a reference genome of the population and a plurality of genotypes of a plurality of individuals in the population with respect to those variable sites; and for each variable site; calculating a first number of bits required to encode the variable site and its associated genotypes according to a run length encoding strategy, wherein calculating the first number of bits required to encode the variable site according to the run length encoding strategy comprises calculating the number of run length entries required to encode the variable site; calculating a second number of bits required to encode the variable site and its associated genotypes according to a bit field encoding strategy; comparing the first number of bits and second number of bits; and encoding, based on the comparison, the variable site and its corresponding genotypes using either the run length encoding strategy or the bit field encoding strategy, wherein the variable site is encoded using the bit field encoding strategy if the first number of bits exceeds the second number of bits, and otherwise the variable site is encoded using the run length encoding strategy.
-
Specification