System and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms
First Claim
1. A machine readable medium having stored thereon instructions which when executed by a processor cause a machine to perform operations comprising:
- processing a plurality of information obtained from an electropherogram output by a base calling system;
creating a plurality of refined base calls using a plurality of original base calls and a plurality of intrinsic peak characteristics;
assigning a quality value to each of the plurality of refined base calls using the plurality of intrinsic peak characteristics; and
training by generating a training file and look-up table, wherein generating the training file comprises;
extracting and processing a group of electropherograms from at least one sample file;
determining called base sequences for each fragment in each electropherogram;
aligning the called base sequences with a corresponding consensus sequence to identify correct and erroneous base calls;
computing trace parameters for each base call of the called base sequences; and
writing the base calls as base call data and the trace parameters to the training file.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms. The method includes processing a plurality of information obtained from a base calling system and creating a plurality of refined base calls using a plurality of original base calls and a plurality of intrinsic peak characteristics. A quality value is also assigned to each of the plurality of refined base calls using the plurality of intrinsic peak characteristics. Processing comprises detecting a plurality of peaks, expanding the plurality of peaks, and resolving the plurality of expanded peaks. Resolving may include fitting the plurality of expanded peaks using a model of a peak shape. A peak resolution parameter is calculated and used in processing. The system may also be trained.
113 Citations
54 Claims
-
1. A machine readable medium having stored thereon instructions which when executed by a processor cause a machine to perform operations comprising:
-
processing a plurality of information obtained from an electropherogram output by a base calling system;
creating a plurality of refined base calls using a plurality of original base calls and a plurality of intrinsic peak characteristics;
assigning a quality value to each of the plurality of refined base calls using the plurality of intrinsic peak characteristics; and
training by generating a training file and look-up table, wherein generating the training file comprises;
extracting and processing a group of electropherograms from at least one sample file;
determining called base sequences for each fragment in each electropherogram;
aligning the called base sequences with a corresponding consensus sequence to identify correct and erroneous base calls;
computing trace parameters for each base call of the called base sequences; and
writing the base calls as base call data and the trace parameters to the training file. - View Dependent Claims (2, 3, 4, 5, 6, 7)
extracting the base call data and the trace parameters from the training file;
generating a set of trace parameter threshold values by partitioning the trace parameters into a plurality of bins;
populating the plurality of bins with the base calls;
defining a group of considered cuts; and
iteratively performing the following until a number of erroneous base calls for each cut of the group of considered cuts is zero;
computing quality values for each cut of the group of considered cuts;
selecting a largest quality value cut as the considered cut having the largest quality value of the group of considered cuts;
storing the largest quality value and corresponding threshold values in the look-up table;
deleting the largest quality value cut from the group of considered cuts; and
adjusting a base call count for the group of considered cuts by deleting those base call counts shared with the largest quality value cut.
-
-
3. The machine readable medium of claim 2 wherein adjusting the base call count comprises:
computing using a four dimensional dynamic programming algorithm the base call counts for a current cut using the base call counts of a system defined number of neighboring cuts and a current bin.
-
4. The machine readable medium of claim 1, wherein the operations further comprise:
-
determining a valve corresponding to a fraction of errors within the alignment of said base-called sequence and said consensus sequence;
determining if the value exceeds a predetermined threshold; and
removing the base calls and associated trace parameters from the training file if said value exceeds said predetermined threshold.
-
-
5. The machine readable medium of claim 4 wherein the predetermined threshold is approximately 10%.
-
6. The machine readable medium of claim 1, wherein the operations further comprise:
-
determining if two or more distinct regions in the consensus sequence have a fraction of errors below a specified threshold when aligned with the base-called sequence; and
deleting the sample file if at least two of said one or more distinct regions have a fraction of errors less than a specified threshold when aligned with the base-called sequence.
-
-
7. The machine readable medium of claim 6 wherein the specified threshold is approximately less than 85%.
-
8. A method for obtaining information from a base calling system, the method comprising:
-
processing a plurality of information obtained from an electropherogram output by the base calling system;
creating a plurality of refined base calls using a plurality of original base calls and a plurality of intrinsic peak characteristics; and
assigning a quality value to each of the plurality of refined base calls using the plurality of intrinsic peak characteristics. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
detecting a plurality of peaks;
expanding the plurality of peaks into a plurality of expanded peaks; and
resolving the plurality of expanded peaks to create a plurality of intrinsic peaks having the plurality of intrinsic peak characteristics.
-
-
10. The method of claim 9 wherein the detecting comprises:
-
identifying a plurality of inflection points; and
selecting the plurality of peaks based on the plurality of inflection points.
-
-
11. The method of claim 10 wherein the detecting further comprises:
computing a plurality of apparent peak characteristics.
-
12. The method of claim 10 wherein the selecting comprises:
ignoring those peaks that have an area under the peak that is smaller than an average per-peak area of a plurality of consecutive preceding peaks.
-
13. The method of claim 9 wherein the expanding comprises:
-
scanning to the left from a left inflection point of each peak of the plurality of peaks to determine an expanded left boundary for the peak; and
scanning to the right from a right inflection point of each peak of the plurality of peaks to determine an expanded right boundary for the peak.
-
-
14. The method of claim 13 wherein the scanning to the left comprises:
-
if a zero signal is detected, designating a location at which the zero signal is detected as the left expanded boundary for the peak;
if the beginning of a trace is detected, designating a location at which the beginning of a trace is detected as the left expanded boundary for the peak;
if a local minimum is detected, designating a location at which the local minimum is detected as the left expanded boundary for the peak;
if a right inflection point of a previous peak is detected, designating a location at which the right inflection point of the previous peak is detected as the left expanded boundary for the peak; and
if the left expanded boundary is to the left of a right expanded boundary of the previous peak, redefining the left expanded boundary of the peak and right expanded boundary of the previous peak as a midpoint between a left inflection point of the peak and the right inflection point of the previous peak.
-
-
15. The method of claim 13 wherein the scanning to the right comprises:
-
if a zero signal is detected, designating a location at which the zero signal is detected as the right expanded boundary for the peak;
if the end of a trace is detected, designating a location at which the end of the trace is detected as the right expanded boundary for the peak;
if a local minimum is detected, designating a location at which the local minimum is detected as the right expanded boundary for the peak;
if a left inflection point of a next peak is detected, designating a location at which the left inflection point of the next peak is detected as the right expanded boundary for the peak;
if the right expanded boundary is to the right of a left expanded boundary of the next peak, redefining the right expanded boundary of the peak and left expanded boundary of the next peak as a midpoint between a right inflection point of the peak and the left inflection point of the next peak.
-
-
16. The method of claim 13 wherein the expanding further comprises:
classifying each of the expanded peaks as single or multiple based on the type of expanded left boundary and the type of expanded right boundary.
-
17. The method of claim 9 wherein the resolving comprises:
-
fitting the plurality of expanded peaks using a model of a peak shape; and
computing a peak resolution.
-
-
18. The method of claim 17 wherein the fitting comprises:
-
deriving an equation representing an output of an electrophoresis of a plurality of DNA fragments, the equation accounting for electromigration and diffusion;
solving the equation to obtain a peak shape expression dependent on at least one of three adjustable parameters;
for each single peak of the plurality of expanded peaks, computing the plurality of intrinsic peak characteristics and the adjustable parameters using the peak shape expression and a set of single peak measurable characteristics; and
for each multiple peak portion of the plurality of expanded peaks, inferring at least one intrinsic peak having the intrinsic peak characteristics.
-
-
19. The method of claim 18 wherein the inferring comprises:
-
computing a set of average measurable peak characteristics based on analysis of a set of preceding single peaks;
calculating a first invariant peak shape parameter and a second invariant peak shape parameter based on the set of average measurable peak characteristics; and
for each of the intrinsic peaks, iteratively computing an intrinsic peak position and an intrinsic peak height based on the first invariant peak shape parameter, the second invariant peak shape parameter, the peak shape expression, and a set of intrinsic peak measurable characteristics.
-
-
20. The method of claim 17 wherein the computing a peak resolution comprises:
dividing an area under a curve representing the absolute value of the difference between an apparent peak signal of an apparent peak of the plurality of peaks and a corresponding fitted model peak by an area of the apparent peak.
-
21. The method of claim 8 wherein the creating comprises:
-
refining the plurality of original base calls using the plurality of intrinsic peak characteristics; and
inserting a plurality of newly called bases using the plurality of intrinsic peak characteristics.
-
-
22. The method of claim 21 wherein the refining comprises:
-
calling true peaks;
resolving wide peaks;
re-calling unknown bases; and
removing unmatched bases.
-
-
23. The method of claim 21 wherein the refining comprises:
-
for each known original base call, scanning a peak list of a particular color base;
if a peak is found at a location of the original base call and the peak is not assigned, calling the peak and assigning a corresponding base to the peak;
if the peak is found at the location of the original base call and the peak has been assigned, determining whether the peak is wide enough to be split;
if the peak is wide enough to be split, resolving the peak into two peaks, calling the two peaks, and reassigning the bases for the two peaks;
if the peak is not wide enough to be split or no peaks are found at the location of the original base, searching for an unassigned peak near the location of the original base call;
if the unassigned peak is found, calling the unassigned peak and assigning a base to the unassigned peak; and
if no unassigned peak is found, temporarily designating the original base call as unknown.
-
-
24. The method of claim 21 wherein the refining comprises:
-
for each unknown base call, scanning four peak lists, one peak list for each base;
if at least one peak is found at a location of an unknown base call, obtaining a best peak;
if the best peak is not assigned, replacing the unknown base call with the corresponding base of the best peak, calling the best peak and assigning a corresponding base to the best peak;
if the best peak has been assigned, determining whether the best peak is wide enough to be split;
if the best peak is wide enough to be split, resolving the best peak into two peaks, replacing the unknown base with the corresponding best peak base, calling the two peaks, and assigning corresponding bases to the two peaks;
if the best peak is not wide enough to be split or no peaks are found at the location of the unknown base call, searching for a best unassigned peak near the location of the unknown base call;
if the best unassigned peak is found near the location of the unknown base call, replacing the unknown base with the corresponding best peak base, calling the peak and assigning the corresponding base to the peak; and
if no best unassigned peak is found, rejecting the unknown base call.
-
-
25. The method of claim 21 wherein inserting the plurality of newly called bases comprises:
-
creating a multi-colored list of all peaks; and
calling those peaks in the multi-colored list of all peaks that meet all of a plurality of criteria.
-
-
26. The method of claim 25 wherein the plurality of criteria comprise:
-
whether an index of an uncalled peak is greater than a specified minimum;
whether an intrinsic height of the uncalled peak is greater than an intrinsic signal of any other peaks at the position of the uncalled peak; and
whether a spacing between two adjacent called peaks is large enough for insertion of a new base.
-
-
27. The method of claim 8 wherein assigning comprises:
-
computing a plurality of peak trace parameters for each intrinsic peak in the plurality of refined base calls; and
obtaining a quality value for each base in the plurality of refined base calls from a look-up table.
-
-
28. The method of claim 27 wherein the plurality of peak trace parameter comprises:
-
a first peak height ratio for a current peak based on a first plurality of called peaks centered at a current peak;
a second peak height ratio for the current peak based on a second plurality of called peaks centered at the current peak;
a peak spacing ratio for the current peak based on a largest peak spacing and a smallest peak spacing of the second plurality of called peaks centered at the current peak and a peak resolution.
-
-
29. The method of claim 28 wherein the first plurality of called peaks centered at the current peak comprises three called peaks, and wherein the second plurality of called peaks centered at the current peak comprises seven called peaks.
-
30. The method of claim 27 wherein the obtaining comprises:
selecting the quality value from the look-up table from the row in the look-up table in which a plurality of table trace parameters in the row of the look-up table each exceed the plurality of peak trace parameters for a current peak.
-
31. The method of claim 8 further comprising:
training, wherein the training includes generating a training file and a look-up table.
-
32. The method of claim 31 wherein the training is conducted at each of a plurality of user locations based on a plurality of characteristics and a plurality of requirements of each of the plurality of user locations.
-
33. The method of claim 31 wherein generating the training file comprises:
-
extracting and processing a group of electropherograms from at least one sample file;
determining called base sequences for each fragment in each electropherogram;
aligning the called base sequences with a corresponding consensus sequence to identify correct and erroneous base calls;
computing trace parameters for each base call of the called base sequences; and
writing the base calls as base call data and the trace parameters to the training file.
-
-
34. The method of claim 33 wherein generating the look-up table comprises:
-
extracting the base call data and the trace parameters from the training file;
generating a set of trace parameter threshold values by partitioning the trace parameters into a plurality of bins;
populating the bins with the base calls;
defining a group of considered cuts; and
iteratively performing the following until a number of erroneous base calls for each cut of the group of considered cuts is zero;
computing quality values for each cut of the group of considered cuts;
selecting a largest quality value cut as the considered cut having the largest quality value of the group of considered cuts;
storing the largest quality value and corresponding threshold values in the look-up table;
deleting the largest quality value cut from the group of considered cuts; and
adjusting a base call count for the group of considered cuts by deleting those base call counts stared with the largest quality value cut.
-
-
35. The method of claim 34 wherein the adjusting comprises:
-
computing using a four dimensional dynamic programming algorithm the base call counts for a current cut using the base call counts of system defined number of neighboring cuts and a current bin; and
for each multiple peak portion of the plurality of peaks, breaking the multiple peak portion into at least two descendent peaks, and computing the plurality of intrinsic peak characteristics for each of the descendent peaks using the peak shape expression.
-
-
36. A machine readable medium having stored thereon instructions which, when executed by a processor, cause the machine to perform operations comprising:
-
processing information obtained by a base calling system from an electropherogram output;
creating a plurality of refined base calls using a plurality of original base calls and a plurality of intrinsic peak characteristics; and
assigning a quality value to each of the plurality of refined base calls using the plurality of intrinsic peak characteristics. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
detecting a plurality of peaks;
expanding the plurality of peaks into a plurality of expanded peaks; and
resolving the plurality of expanded peaks to create a plurality of intrinsic peaks having the plurality of intrinsic peak characteristics.
-
-
38. The machine readable medium of claim 37 wherein the detecting comprises:
-
identifying a plurality of inflection points; and
selecting the plurality of peaks based on the plurality of inflection points.
-
-
39. The machine readable medium of claim 38 wherein the detecting further comprises:
computing a plurality of apparent peak characteristics.
-
40. The machine readable medium of claim 37 wherein the expanding comprises:
-
scanning to the left from a left inflection point of each peak of the plurality of peaks to determine an expanded left boundary for the peak; and
scanning to the right from a right inflection point of each peak of the plurality of peaks to determine an expanded right boundary for the peak.
-
-
41. The machine readable medium of claim 37 wherein the resolving comprises:
-
fitting the plurality of expanded peaks using a model of a peak shape; and
computing a peak resolution.
-
-
42. The machine readable medium of claim 41 wherein the fitting comprises:
-
for each single peak of the plurality of expanded peaks, computing the plurality of intrinsic peak characteristics and the adjustable parameters using the peak shape expression and a set of single peak measurable characteristics; and
for each multiple peak portion of the plurality of expanded peaks, inferring at least one intrinsic peak having the intrinsic peak characteristics.
-
-
43. The machine readable medium of claim 42 wherein the inferring comprises:
-
computing a set of average measurable peak characteristics based on analysis of a set of preceding single peaks;
calculating a first invariant peak shape parameter and a second invariant peak shape parameter based on the set of average measurable peak characteristics; and
for each of the intrinsic peaks, iteratively computing an intrinsic peak position and an intrinsic peak height based on the first invariant peak shape parameter, the second invariant peak shape parameter, the peak shape expression, and a set of intrinsic peak measurable characteristics.
-
-
44. The machine readable medium of claim 41 wherein computing the peak resolution comprises:
dividing an area under a curve representing the absolute value of a difference between an apparent peak signal of an apparent peak of the plurality of peaks and a corresponding fitted model peak by an area of the apparent peak.
-
45. The machine readable medium of claim 36 wherein creating comprises:
-
refining the plurality of original base calls using the plurality of intrinsic peak characteristics; and
inserting a plurality of newly called bases using the plurality of intrinsic peak characteristics.
-
-
46. The machine readable medium of claim 45 wherein the refining comprises:
-
calling true peaks;
resolving wide peaks;
re-calling unknown bases; and
removing unmatched bases.
-
-
47. The machine readable medium of claim 45 wherein the refining comprises:
-
for each known original base call, scanning a peak list of a particular color base;
if a peak is found at a location of the original base call and the peak is not assigned, calling the peak and assigning a corresponding base to the peak;
if the peak is found at the location of the original base call and the peak has been assigned, determining whether the peak is wide enough to be split;
if the peak is wide enough to be split, resolving the peak into two peaks, calling the two peaks, and reassigning the bases for the two peaks;
if the peak is not wide enough to be split or no peaks are found at the location of the original base, searching for an unassigned peak near the location of the original base call;
if the unassigned peak is found, calling the unassigned peak and assigning a base to the unassigned peak; and
if no unassigned peak is found, temporarily designating the original base call as unknown.
-
-
48. The machine readable medium of claim 45 wherein the refining comprises:
-
for teach unknown base call, scanning four peak lists, one peak list for each base;
if at least one peak is found at a location of an unknown base call, obtaining a best peak;
if the best peak is not assigned, replacing the unknown base call with the corresponding base of the best peak, calling the best peak and assigning a corresponding base to the best peak;
if the best peak has been assigned, determining whether the best peak is wide enough to be split;
if the best peak is wide enough to be split, resolving the best peak into two peaks, replacing the unknown base with the corresponding best peak base, calling the two peaks, and assigning corresponding bases to the two peaks;
if the best peak is not wide enough to be split or no peaks are found at the location of the unknown base call, searching for a best unassigned peak near the location of the unknown base call;
if the best unassigned peak is found near the location of the unknown base call, replacing the unknown base with the corresponding best peak base, calling the peak and assigning the corresponding base to the peak; and
if no best unassigned peak is found, rejecting the unknown base call.
-
-
49. The machine readable medium of claim 45 wherein the inserting the plurality of newly called bases comprises:
-
creating a multi-colored list of all peaks; and
calling those peaks in the multi-colored list of all peaks that meet all of a plurality of criteria.
-
-
50. The machine readable medium of claim 49 wherein the plurality of criteria comprises:
-
whether an index of an uncalled peak is greater than a specified minimum;
whether an intrinsic height of the uncalled peak is greater than an intrinsic signal of any other peaks at the position of the uncalled peak; and
whether a spacing between two adjacent called peaks is large enough for insertion of a new base.
-
-
51. The machine readable medium of claim 36 wherein the assigning comprises:
-
computing a plurality of peak trace parameters for each intrinsic peak in the plurality of refined base calls; and
obtaining a quality value for each base in the plurality of refined base calls from a look-up table.
-
-
52. The machine readable medium of claim 51 wherein the plurality of peak trace parameters comprises:
-
a first peak height ratio for current peak based on a first plurality of called peaks centered at a current peak;
a second peak height ratio for the current peak based on a second plurality of called peaks centered at the current peak;
a peak spacing ratio for the current peak based on a largest peak spacing and a smallest peak spacing of the second plurality of called peaks centered at the current peak; and
a peak resolution.
-
-
53. The machine readable medium of claim 51 wherein the obtaining comprises:
selecting the quality value from the look-up table from the row in the look-up table in which a plurality of table trace parameters in the row of the look-up table each exceed the plurality of peak trace parameters for a current peak.
-
54. The machine readable medium of claim 36 containing further instructions which when executed by a processor cause the machine to perform operations comprising:
training by generating a training file and a look-up table.
Specification