Detecting fetal sub-chromosomal aneuploidies

US 10,318,704 B2
Filed: 05/29/2015
Issued: 06/11/2019
Est. Priority Date: 05/30/2014
Status: Active Grant

First Claim

Patent Images

1. A method, implemented at a computer system that includes one or more processors and system memory, for evaluation of copy number of a sequence of interest in a test sample comprising nucleic acids, the method comprising:

(a) receiving, by the computer system, sequence reads obtained by sequencing DNA in the test sample;

(b) aligning, by the computer system, the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome;

(c) determining, by the computer system, coverages of the test sequence tags for the bins in the reference genome including the sequence of interest;

(d) adjusting, by the computer system, the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by;

(i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected,(ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and(iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and

(e) making, by the computer system, a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are methods for determining copy number variation (CNV) known or suspected to be associated with a variety of medical conditions, including syndromes related to CNV of subchromosomal regions. In some embodiments, methods are provided for determining CNV of fetuses using maternal samples comprising maternal and fetal cell free DNA. Some embodiments disclosed herein provide methods to improve the sensitivity and/or specificity of sequence data analysis by removing within-sample GC-content bias. In some embodiments, removal of within-sample GC-content bias is based on sequence data corrected for systematic variation common across unaffected training samples. In some embodiments, syndrome related biases in sample data are also removed to increase signal to noise ratio. Also disclosed are systems for evaluation of CNV of sequences of interest.

Citations

27 Claims

1. A method, implemented at a computer system that includes one or more processors and system memory, for evaluation of copy number of a sequence of interest in a test sample comprising nucleic acids, the method comprising:
- (a) receiving, by the computer system, sequence reads obtained by sequencing DNA in the test sample;
  
  (b) aligning, by the computer system, the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome;
  
  (c) determining, by the computer system, coverages of the test sequence tags for the bins in the reference genome including the sequence of interest;
  
  (d) adjusting, by the computer system, the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by;
  
  (i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected,(ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and(iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and
  
  (e) making, by the computer system, a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 2. The method of claim 1, further comprising determining, based on (e), whether one or more of the genomes has a chromosomal aneuploidy.
  - 3. The method of claim 1, further comprising, before (d), adjusting the coverages of the test sequence tags by applying a global wave profile obtained from the training set, wherein the global wave profile comprises coverages of bins in the reference genome averaged across the training set.
  - 4. The method of claim 3, further comprising, before (d), adjusting the coverages of the test sequence tags based on a relation between GC content level and coverage among the bins of the test sample.
  - 5. The method of claim 1, wherein determining the coverages for the bins in (c) comprises normalizing counts of tags per bin with respect to a total number of sequence tags over all bins, and wherein the coverages adjusted in (d) are normalized coverages.
  - 6. The method of claim 1, wherein the bins outside the sequence of interest used in (d) are bins in one or more human autosomes other than chromosomes 13, 18, and 21.
  - 7. The method of claim 1, wherein the bins outside the sequence of interest are identified by determining correlation distances between a coverage in a bin under consideration within the sequence of interest and a coverage of each individual bin of the bins outside the sequence of interest.
  - 8. The method of claim 7, wherein the correlation distances are calculated as the distances between vectors of bin coverages created from samples of the training set.
  - 9. The method of claim 1, wherein selecting the training samples comprises identifying a cluster of the training samples in the training set.
  - 10. The method of claim 1, wherein obtaining the expected coverages comprises determining a central tendency of the coverages of the subset of the training set.
  - 11. The method of claim 1, further comprising repeating (d) for a number of iterations, wherein each iteration uses adjusted coverages from a previous iteration as the coverages to be adjusted in a current iteration, and wherein each iteration employs expected coverages obtained from a different subset of the training set.
  - 12. The method of claim 1, wherein adjusting the coverages of the test sequence tags for the bins in the reference genome in operation (d) comprises:
    - fitting a function to data points, each data point relating an expected coverage to a corresponding coverage for the test sample in a bin; and
      
      adjusting the coverages for the bins by applying the coverages for the bins to the function.
  - 13. The method of claim 12, wherein the function is a linear function.
  - 14. The method of claim 1, wherein adjusting the coverages of the test sequence tags in operation (d) comprises subtracting the expected coverages from the coverages of the test sequence tags.
  - 15. The method of claim 1, further comprising performing segmentation to determine start and end points of a syndrome specific region as the sequence of interest.
  - 16. The method of claim 1, wherein the test sample comprises a mixture of nucleic acids from two different genomes.
  - 17. The method of claim 1, wherein said DNA comprises cfDNA molecules.
  - 18. The method of claim 1, wherein the test sample comprises fetal and maternal cell-free nucleic acids.
  - 19. The method of claim 1, wherein the test sample comprises nucleic acids from cancerous and unaffected cells from the same subject.
  - 20. The method of claim 1, wherein the sequencing reads are obtained by an initial multiplex sequencing, further comprising:
    - determining that the test sample has a first value for calling a syndrome classification or a copy number variation higher than a first threshold;
      
      resequencing the test sample at a sequencing depth deeper than the initial multiplex sequencing to obtain resequenced data; and
      
      determining the syndrome classification or the copy number variation using the resequenced data.
  - 21. The method of claim 20, wherein determining the syndrome classification or the copy number variation using the resequenced data comprises:
    - obtaining a second value for calling a syndrome classification or a copy number variation from the resequenced data; and
      
      comparing the second value to a second threshold, wherein the second threshold is higher than the first threshold.
  - 22. The method of claim 20, wherein the test sample has the first value lower than a preset value, wherein the preset value is higher than the first threshold, and wherein samples lower than the first threshold are determined to be unaffected, samples higher than the preset value are determined to be affected, and samples ranging from the first threshold to the preset value are identified for resequencing.
  - 23. The method of claim 20, wherein the test sample'"'"'s first value is relatively low compared to known affected samples.
  - 24. The method of claim 20, wherein the test sample'"'"'s first value is lower than about 90% of known affected samples.
  - 25. The method of claim 1, wherein the genetic syndrome is selected from the group consisting of:
    - 1p36 deletion syndrome, Wolf-Hirschhorn syndrome, Cri-du-Chat syndrome, Angelman syndrome, Williams syndrome, and DiGeorge syndrome, and the method further comprises diagnosing the genetic syndrome.

26. A computer program product comprising a non-transitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to implement a method for evaluation of copy number of a sequence of interest, said program code comprising code for:
- (a) receiving sequence reads obtained by sequencing DNA in a test sample;
  
  (b) aligning the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome;
  
  (c) determining coverages of the test sequence tags for the bins in the reference genome, including the sequence of interest;
  
  (d) adjusting the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by;
  
  (i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected,(ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and(iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and
  
  (e) making a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d).

27. A system for evaluation of copy number of a sequence of interest related to a genetic syndrome using a test sample comprising nucleic acids, the system comprising:
- a sequencer for receiving nucleic acids from the test sample and providing nucleic acid sequence information from the test sample;
  
  logic designed or configured to execute or cause the following operations;
  
  (a) receiving sequence reads obtained by sequencing DNA in the test sample;
  
  (b) aligning the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome;
  
  (c) determining coverages of the test sequence tags for the bins in the reference genome, including the sequence of interest;
  
  (d) adjusting the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by;
  
  (i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected,(ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and(iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and
  
  (e) making a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verinata Health Incorporated (Illumina Incorporated)
Original Assignee
Verinata Health Incorporated (Illumina Incorporated)
Inventors
Chudova, Darya I., Abdueva, Diana
Primary Examiner(s)
Brusca, John S
Assistant Examiner(s)
Wise, Olivia M

Application Number

US14/726,183
Publication Number

US 20160019338A1
Time in Patent Office

1,474 Days
Field of Search

None
US Class Current
CPC Class Codes

C12Q 1/6858   Allele-specific amplification

C12Q 1/6869   Methods for sequencing

C12Q 2535/122   Massive parallel sequencing

C12Q 2537/16   Assays for determining copy...

C12Q 2537/165   Mathematical modelling, e.g...

G16B 20/00   ICT specially adapted for f...

G16B 20/10   Ploidy or copy number detec...

G16B 20/20   Allele or variant detection...

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

Detecting fetal sub-chromosomal aneuploidies

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Detecting fetal sub-chromosomal aneuploidies

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links