BASECALLER FOR DNA SEQUENCING USING MACHINE LEARNING
First Claim
1. A method of calling one or more bases for a nucleic acid of an organism, the method comprising:
- receiving, at a computer system, a basecalling model, the basecalling model configured to;
receive inputs of intensity values for bases at one or more positions on a nucleic acid, andoutput a base call for each of the one or more positions, wherein the basecalling model is trained using a statistically significant number of assumed sequences of training nucleic acids and corresponding intensity values for bases at the positions of the assumed sequences, the corresponding intensity values being obtained from one or more first sequencing processes of training nucleic acids;
receiving, at the computer system, sequencing data of test nucleic acids from a second sequencing process that is different from any of the first sequencing processes, the sequencing data including intensity values for bases at a plurality of positions of a first test nucleic acid;
for each of N positions of the first test nucleic acid;
identifying intensity values corresponding to the position;
determining, by the computer system, a first base call at a first of the N positions using the basecalling model based on inputs of the intensity values for the N positions, where N is an integer equal to or greater than 1.
4 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatuses are provided for creating and using a machine-leaning model to call a base at a position of a nucleic acid based on intensity values measured during a production sequencing run. The model can be trained using training data from training sequencing runs performed earlier. The model is trained using intensity values and assumed sequences that are determined as the correct output. The training data can be filtered to improve accuracy. The training data can be selected in a specific manner to be representative of the type of organism to be sequenced. The model can be trained to use intensity signals from multiple cycles and from neighboring nucleic acids to improve accuracy in the base calls.
-
Citations
22 Claims
-
1. A method of calling one or more bases for a nucleic acid of an organism, the method comprising:
-
receiving, at a computer system, a basecalling model, the basecalling model configured to; receive inputs of intensity values for bases at one or more positions on a nucleic acid, and output a base call for each of the one or more positions, wherein the basecalling model is trained using a statistically significant number of assumed sequences of training nucleic acids and corresponding intensity values for bases at the positions of the assumed sequences, the corresponding intensity values being obtained from one or more first sequencing processes of training nucleic acids; receiving, at the computer system, sequencing data of test nucleic acids from a second sequencing process that is different from any of the first sequencing processes, the sequencing data including intensity values for bases at a plurality of positions of a first test nucleic acid; for each of N positions of the first test nucleic acid; identifying intensity values corresponding to the position; determining, by the computer system, a first base call at a first of the N positions using the basecalling model based on inputs of the intensity values for the N positions, where N is an integer equal to or greater than 1. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A method of creating a basecalling model, the method comprising:
-
receiving sequencing data of training nucleic acids from one or more sequencing processes, the sequencing data including intensity values for bases at positions of the training nucleic acids, the training nucleic acids being from one or more training samples; for each of a set of the training nucleic acids; performing an initial base call at positions of the training nucleic acid to obtain an initial sequence based at least on the intensity values at the positions of the training nucleic acid; filtering the initial sequences to obtain a set of filtered sequences, the filtering removing all or a portion of at least one of the initial sequences; determining assumed sequences corresponding to the initial sequences, wherein an assumed sequence is assumed to be a correct sequence for the positions of the corresponding training nucleic acid; and generating the basecalling model using the filtered sequences and the intensity values corresponding to the filtered sequences, wherein the method is implemented using a computer system.
-
Specification