×

Method and apparatus for calling single-nucleotide variations and other variations

  • US 10,089,436 B2
  • Filed: 02/13/2014
  • Issued: 10/02/2018
  • Est. Priority Date: 11/01/2013
  • Status: Active Grant
First Claim
Patent Images

1. A method of improving accuracy for identifying a base call included in a target sequence produced by a hardware sequencing machine, comprising:

  • generating sequencing read data using the hardware sequencing machine;

    retrieving the sequencing read data generated by the hardware sequencing machine, wherein the sequencing read data include a plurality of sequencing reads;

    retrieving a reference sequence that includes a plurality of base values arranged in a particular sequence;

    selecting the plurality of sequencing reads generated by the hardware sequencing machine, whereineach sequencing read represents a different portion of the target sequence;

    each sequencing read includes a plurality of base locations arranged in a particular sequence; and

    each base locations is assigned (1) an estimated base value, (2) a Phred quality score that is generated and assigned by the hardware sequencing machine and represents a likelihood that the estimate base value being accurate, and (3) a depth level which represents a total number of sequencing reads, in the plurality of sequencing reads, covering that base location;

    (A) determining base values for high-confidence base locations included in the plurality of sequencing reads, including;

    selecting a first subset of base locations from the plurality of sequencing reads in accordance with a determination that depth levels assigned to the first subset of base locations exceed a predetermined depth level;

    deeming, the first subset of base locations, the high-confidence base locations; and

    determining base values in the high-confidence locations by applying a statistical method to the estimated base values assigned to the high confidence base locations;

    (B) determining base values for low-confidence base locations included in the plurality of sequencing reads, including;

    selecting a second subset of locations out of base values included in the plurality of sequencing reads in accordance with (1) the Phred scores and (2) the depth level assigned to the second subset of locations;

    deeming, the second subset of locations, the low-confidence locations;

    separating the plurality of sequencing reads into (1) a first group of high quality reads and (2) a second group of low quality reads in accordance with whether every location within a sequencing read has an assigned Phred score that is more than above a predefined threshold;

    constructing a target-sequence prediction table based on high confidence locations in the reference sequence, the target-sequence prediction table having (a) a row index for four individual base values (A, C, G, T) that may occur at a location of the reference sequence and (b) a column index for ten diploid combinations (AA, CC, GG, TT, AC, AG, AT, CG, CT, GT) that may occur at that location in one sequencing read of the plurality of sequencing reads;

    constructing a high quality read prediction table based on the first group of high quality reads, the high quality read prediction having (a) a column index that corresponds to the ten diploid combinations that may occur at a location of the target sequence and (b) a row index that corresponds to the four individual base values that may be identified by the hardware sequencing machine for that location;

    constructing a low quality read prediction table based on the second group of high quality reads, the low quality read prediction having (a) a column index that corresponds to the ten diploid combinations that may occur at a location of the target sequence and (b) a column index that corresponds to the four individual base values that may be identified by the hardware sequencing machine for that location;

    selecting a prior probability value from the target-sequence prediction table;

    selecting a conditional probability value from either the high quality read prediction table or the low quality read prediction table; and

    identifying a base call included in the target sequence in accordance with (1) the prior probability value from the target-sequence prediction table and (2) the conditional probability value from either the high quality read prediction table or the low quality read prediction table.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×