Basecalling for stochastic sequencing processes
First Claim
1. A method of using a sequencing cell, the method comprising:
- obtaining a first set of signal values measured from a nucleic acid over a first time interval for a sequencing cell, wherein the first set of signal values includes measurements for each of four cell states of the sequencing cell, the four cell states corresponding to different types of nucleotides;
creating a first histogram of the first set of signal values, the first histogram being a data structure storing a plurality of counts, each count corresponding to a number of signal values within a bin, each bin of the first histogram corresponding to different numerical values;
for each cell state of the four cell states;
determining a probability function that assigns emission probabilities of being in the cell state to the different numerical values, the probability function determined using the plurality of counts for the bins of the first histogram;
determining a transmission matrix providing pairwise transition probabilities between four nucleotide states of the nucleic acid, the four nucleotide states corresponding to the different types of nucleotides;
creating a trellis diagram over T time steps, each time step corresponding to one signal value of the first set of signal values, wherein the trellis diagram at a given time step includes the four nucleotide states, each having an emission probability determined using a probability function of a corresponding cell state, and wherein nucleotide states at one time step are connected to nucleotide states at a next time step in accordance with the pairwise transition probabilities;
determining an optimal path through the trellis diagram based on the emission probabilities and the pairwise transition probabilities to identify a nucleotide state at each time step;
determining bases comprising a sequence of the nucleic acid using the nucleotide states at the T time steps; and
providing the sequence of the nucleic acid.
1 Assignment
0 Petitions
Accused Products
Abstract
Techniques for measuring sequences of nucleic acids are provided. Time-based measurements (e.g., forming a histogram) particular to a given sequencing cell can be used to generate a tailored model. The model can include probability functions, each corresponding to different states (e.g., different states of a nanopore). Such probability functions can be fit to a histogram of measurements obtained for that cell. The probability functions can be updated over a sequencing run of the nucleic acid so that drifts in physical properties of the sequencing cell can be compensated. A hidden Markov model can use such probability functions as emission probabilities for determining the most likely nucleotide states over time. For sequencing cells involving a polymerase, a 2-state classification between bound and unbound states of the polymerase can be performed. The bound regions can be further analyzed by a second classifier to distinguish between states corresponding to different bound nucleotides.
77 Citations
19 Claims
-
1. A method of using a sequencing cell, the method comprising:
-
obtaining a first set of signal values measured from a nucleic acid over a first time interval for a sequencing cell, wherein the first set of signal values includes measurements for each of four cell states of the sequencing cell, the four cell states corresponding to different types of nucleotides; creating a first histogram of the first set of signal values, the first histogram being a data structure storing a plurality of counts, each count corresponding to a number of signal values within a bin, each bin of the first histogram corresponding to different numerical values; for each cell state of the four cell states; determining a probability function that assigns emission probabilities of being in the cell state to the different numerical values, the probability function determined using the plurality of counts for the bins of the first histogram; determining a transmission matrix providing pairwise transition probabilities between four nucleotide states of the nucleic acid, the four nucleotide states corresponding to the different types of nucleotides; creating a trellis diagram over T time steps, each time step corresponding to one signal value of the first set of signal values, wherein the trellis diagram at a given time step includes the four nucleotide states, each having an emission probability determined using a probability function of a corresponding cell state, and wherein nucleotide states at one time step are connected to nucleotide states at a next time step in accordance with the pairwise transition probabilities; determining an optimal path through the trellis diagram based on the emission probabilities and the pairwise transition probabilities to identify a nucleotide state at each time step; determining bases comprising a sequence of the nucleic acid using the nucleotide states at the T time steps; and providing the sequence of the nucleic acid. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method of using a sequencing cell, the method comprising:
-
obtaining a first set of signal values measured from a nucleic acid over a first time interval for a sequencing cell, wherein the first set of signal values includes measurements for each of four cell states of the sequencing cell, the four cell states corresponding to different types of nucleotides; creating a first histogram of the first set of signal values, the first histogram being a data structure storing a plurality of counts, each count corresponding to a number of signal values within a bin, each bin of the first histogram corresponding to different numerical values; for each cell state of the four cell states; obtaining an initial probability function that assigns emission probabilities of being in the cell state to the different numerical values; and using the initial probability function and the first histogram to determine a first probability function that assigns emission probabilities of being in the cell state to the different numerical values, the first probability function corresponding to the first time interval; determining second probability functions corresponding to a second time interval, the first probability functions and the second probability functions forming a set of time-dependent probability functions, wherein the second probability function is determined using the first probability functions and a second histogram determined from a second set of signal values measured from the nucleic acid over the second time interval for the sequencing cell; determining bases comprising a sequence of the nucleic acid using the set of time-dependent probability functions; and providing the sequence of the nucleic acid. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
Specification