Detecting fetal sub-chromosomal aneuploidies
First Claim
1. A method, implemented at a computer system that includes one or more processors and system memory, for evaluation of copy number of a sequence of interest in a test sample comprising nucleic acids, the method comprising:
- (a) receiving, by the computer system, sequence reads obtained by sequencing DNA in the test sample;
(b) aligning, by the computer system, the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome;
(c) determining, by the computer system, coverages of the test sequence tags for the bins in the reference genome including the sequence of interest;
(d) adjusting, by the computer system, the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by;
(i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected,(ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and(iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and
(e) making, by the computer system, a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d).
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed are methods for determining copy number variation (CNV) known or suspected to be associated with a variety of medical conditions, including syndromes related to CNV of subchromosomal regions. In some embodiments, methods are provided for determining CNV of fetuses using maternal samples comprising maternal and fetal cell free DNA. Some embodiments disclosed herein provide methods to improve the sensitivity and/or specificity of sequence data analysis by removing within-sample GC-content bias. In some embodiments, removal of within-sample GC-content bias is based on sequence data corrected for systematic variation common across unaffected training samples. In some embodiments, syndrome related biases in sample data are also removed to increase signal to noise ratio. Also disclosed are systems for evaluation of CNV of sequences of interest.
-
Citations
27 Claims
-
1. A method, implemented at a computer system that includes one or more processors and system memory, for evaluation of copy number of a sequence of interest in a test sample comprising nucleic acids, the method comprising:
-
(a) receiving, by the computer system, sequence reads obtained by sequencing DNA in the test sample; (b) aligning, by the computer system, the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome; (c) determining, by the computer system, coverages of the test sequence tags for the bins in the reference genome including the sequence of interest; (d) adjusting, by the computer system, the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by; (i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected, (ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and (iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and (e) making, by the computer system, a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A computer program product comprising a non-transitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to implement a method for evaluation of copy number of a sequence of interest, said program code comprising code for:
-
(a) receiving sequence reads obtained by sequencing DNA in a test sample; (b) aligning the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome; (c) determining coverages of the test sequence tags for the bins in the reference genome, including the sequence of interest; (d) adjusting the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by; (i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected, (ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and (iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and (e) making a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d).
-
-
27. A system for evaluation of copy number of a sequence of interest related to a genetic syndrome using a test sample comprising nucleic acids, the system comprising:
-
a sequencer for receiving nucleic acids from the test sample and providing nucleic acid sequence information from the test sample; logic designed or configured to execute or cause the following operations; (a) receiving sequence reads obtained by sequencing DNA in the test sample; (b) aligning the sequence reads of the test sample to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins, wherein the sequence of interest is in a sub-chromosomal genomic region in which a copy number variation is associated with a genetic syndrome; (c) determining coverages of the test sequence tags for the bins in the reference genome, including the sequence of interest; (d) adjusting the coverages of the test sequence tags for the bins in the reference genome by employing expected coverages for the bins obtained from a subset of a training set of unaffected training samples sequenced and aligned in substantially the same manner as the test sample, wherein the expected coverages for the bins in the reference genome were obtained by; (i) selecting a plurality of bins outside the sequence of interest, wherein each selected bin has a correlation in coverage meeting a first criterion with a bin in the sequence of interest, and wherein the first criterion excludes one or more bins outside the sequence of interest from being selected, (ii) selecting training samples from the training set to form the subset of the training set, wherein the selected training samples have correlations meeting a second criterion with each other in their coverages in the plurality of bins outside the sequence of interest, and wherein the second criterion excludes one or more training samples from being selected, and (iii) obtaining the expected coverages for the bins in the reference genome based on the subset of the training set'"'"'s coverages in the bins in the reference genome; and (e) making a call of the copy number variation of the sequence of interest in the test sample based on the adjusted coverages from (d).
-
Specification