Method and apparatus for calling single-nucleotide variations and other variations

US 10,089,436 B2
Filed: 02/13/2014
Issued: 10/02/2018
Est. Priority Date: 11/01/2013
Status: Active Grant

First Claim

Patent Images

1. A method of improving accuracy for identifying a base call included in a target sequence produced by a hardware sequencing machine, comprising:

generating sequencing read data using the hardware sequencing machine;

retrieving the sequencing read data generated by the hardware sequencing machine, wherein the sequencing read data include a plurality of sequencing reads;

retrieving a reference sequence that includes a plurality of base values arranged in a particular sequence;

selecting the plurality of sequencing reads generated by the hardware sequencing machine, whereineach sequencing read represents a different portion of the target sequence;

each sequencing read includes a plurality of base locations arranged in a particular sequence; and

each base locations is assigned (1) an estimated base value, (2) a Phred quality score that is generated and assigned by the hardware sequencing machine and represents a likelihood that the estimate base value being accurate, and (3) a depth level which represents a total number of sequencing reads, in the plurality of sequencing reads, covering that base location;

(A) determining base values for high-confidence base locations included in the plurality of sequencing reads, including;

selecting a first subset of base locations from the plurality of sequencing reads in accordance with a determination that depth levels assigned to the first subset of base locations exceed a predetermined depth level;

deeming, the first subset of base locations, the high-confidence base locations; and

determining base values in the high-confidence locations by applying a statistical method to the estimated base values assigned to the high confidence base locations;

(B) determining base values for low-confidence base locations included in the plurality of sequencing reads, including;

selecting a second subset of locations out of base values included in the plurality of sequencing reads in accordance with (1) the Phred scores and (2) the depth level assigned to the second subset of locations;

deeming, the second subset of locations, the low-confidence locations;

separating the plurality of sequencing reads into (1) a first group of high quality reads and (2) a second group of low quality reads in accordance with whether every location within a sequencing read has an assigned Phred score that is more than above a predefined threshold;

constructing a target-sequence prediction table based on high confidence locations in the reference sequence, the target-sequence prediction table having (a) a row index for four individual base values (A, C, G, T) that may occur at a location of the reference sequence and (b) a column index for ten diploid combinations (AA, CC, GG, TT, AC, AG, AT, CG, CT, GT) that may occur at that location in one sequencing read of the plurality of sequencing reads;

constructing a high quality read prediction table based on the first group of high quality reads, the high quality read prediction having (a) a column index that corresponds to the ten diploid combinations that may occur at a location of the target sequence and (b) a row index that corresponds to the four individual base values that may be identified by the hardware sequencing machine for that location;

constructing a low quality read prediction table based on the second group of high quality reads, the low quality read prediction having (a) a column index that corresponds to the ten diploid combinations that may occur at a location of the target sequence and (b) a column index that corresponds to the four individual base values that may be identified by the hardware sequencing machine for that location;

selecting a prior probability value from the target-sequence prediction table;

selecting a conditional probability value from either the high quality read prediction table or the low quality read prediction table; and

identifying a base call included in the target sequence in accordance with (1) the prior probability value from the target-sequence prediction table and (2) the conditional probability value from either the high quality read prediction table or the low quality read prediction table.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Base calls for a target sequence may be identified relative to a reference sequence by using values from sequencing reads at locations satisfying a high-confidence condition to identify base calls at a given location not satisfying the high-confidence condition. The high-confidence condition may relate to the level of coverage by the sequencing reads at a location of the reference sequence. The quality of measurements of the sequencing reads may be incorporated into the base-call process.

Citations

8 Claims

1. A method of improving accuracy for identifying a base call included in a target sequence produced by a hardware sequencing machine, comprising:
- generating sequencing read data using the hardware sequencing machine;
  
  retrieving the sequencing read data generated by the hardware sequencing machine, wherein the sequencing read data include a plurality of sequencing reads;
  
  retrieving a reference sequence that includes a plurality of base values arranged in a particular sequence;
  
  selecting the plurality of sequencing reads generated by the hardware sequencing machine, whereineach sequencing read represents a different portion of the target sequence;
  
  each sequencing read includes a plurality of base locations arranged in a particular sequence; and
  
  each base locations is assigned (1) an estimated base value, (2) a Phred quality score that is generated and assigned by the hardware sequencing machine and represents a likelihood that the estimate base value being accurate, and (3) a depth level which represents a total number of sequencing reads, in the plurality of sequencing reads, covering that base location;
  
  (A) determining base values for high-confidence base locations included in the plurality of sequencing reads, including;
  
  selecting a first subset of base locations from the plurality of sequencing reads in accordance with a determination that depth levels assigned to the first subset of base locations exceed a predetermined depth level;
  
  deeming, the first subset of base locations, the high-confidence base locations; and
  
  determining base values in the high-confidence locations by applying a statistical method to the estimated base values assigned to the high confidence base locations;
  
  (B) determining base values for low-confidence base locations included in the plurality of sequencing reads, including;
  
  selecting a second subset of locations out of base values included in the plurality of sequencing reads in accordance with (1) the Phred scores and (2) the depth level assigned to the second subset of locations;
  
  deeming, the second subset of locations, the low-confidence locations;
  
  separating the plurality of sequencing reads into (1) a first group of high quality reads and (2) a second group of low quality reads in accordance with whether every location within a sequencing read has an assigned Phred score that is more than above a predefined threshold;
  
  constructing a target-sequence prediction table based on high confidence locations in the reference sequence, the target-sequence prediction table having (a) a row index for four individual base values (A, C, G, T) that may occur at a location of the reference sequence and (b) a column index for ten diploid combinations (AA, CC, GG, TT, AC, AG, AT, CG, CT, GT) that may occur at that location in one sequencing read of the plurality of sequencing reads;
  
  constructing a high quality read prediction table based on the first group of high quality reads, the high quality read prediction having (a) a column index that corresponds to the ten diploid combinations that may occur at a location of the target sequence and (b) a row index that corresponds to the four individual base values that may be identified by the hardware sequencing machine for that location;
  
  constructing a low quality read prediction table based on the second group of high quality reads, the low quality read prediction having (a) a column index that corresponds to the ten diploid combinations that may occur at a location of the target sequence and (b) a column index that corresponds to the four individual base values that may be identified by the hardware sequencing machine for that location;
  
  selecting a prior probability value from the target-sequence prediction table;
  
  selecting a conditional probability value from either the high quality read prediction table or the low quality read prediction table; and
  
  identifying a base call included in the target sequence in accordance with (1) the prior probability value from the target-sequence prediction table and (2) the conditional probability value from either the high quality read prediction table or the low quality read prediction table.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the hardware sequencing machine is a next-generation sequencing equipment.
  - 3. The method of claim 2, wherein the next-generation sequencing equipment is an ILLUMINA sequencing machine.
  - 4. The method of claim 2, wherein the next-generation sequencing equipment is a high throughput sequencing machine.
  - 5. The method of claim 1 is executed by a computer that is different and separate from the hardware sequencing machine.
  - 6. The method of claim 1, wherein a base value is selected as a high-confidence base value when the depth value associated with the base value is 50 or more.
  - 7. The method of claim 1, wherein identifying a base call included in the target sequence is further in accordance with a Bayesian prediction model, the Bayesian prediction model providing likelihood values that relate the base values of the sequencing reads at the high-confidence locations and the base values of the reference sequence at multiple locations with the one or more base calls for the target sequence at the given location.
  - 8. The method of claim 1, wherein the sequencing read data generated by the hardware sequencing machine have an error rate of at least 1% and base value differences between the target sequence and the reference sequence are less than 1%.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accurascience, LLC
Original Assignee
Accurascience, LLC
Inventors
Li, Tongbin, Gong, Wuming, Rao, Jiang
Primary Examiner(s)
Woitach, Joseph

Application Number

US14/358,643
Publication Number

US 20160026757A1
Time in Patent Office

1,692 Days
Field of Search

None
US Class Current
CPC Class Codes

G16B 20/00   ICT specially adapted for f...

G16B 20/10   Ploidy or copy number detec...

G16B 20/20   Allele or variant detection...

G16B 30/00   ICT specially adapted for s...

G16B 40/00   ICT specially adapted for b...

Method and apparatus for calling single-nucleotide variations and other variations

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for calling single-nucleotide variations and other variations

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links