Methods and systems for copy number variant detection
First Claim
Patent Images
1. A method comprising:
- receiving, by a computing device, a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SSQC) metrics;
grouping, by the computing device, sets of sequencing quality control (SQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths;
selecting, by the computing device, a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics;
normalizing, by the computing device, the sample coverage data set and the reference panel;
fitting, by the computing device, the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and
identifying one or more copy number variants (CNVs) by comparing, by the computing device, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and systems for determining copy number variants are disclosed. An example method can comprise applying a sample grouping technique to select reference coverage data, normalizing sample coverage data comprising a plurality of genomic regions, and fitting a mixture model to the normalized sample coverage data based on the selected reference coverage data. An example method can comprise identifying one or more copy number variants (CNVs) according to a Hidden Markov Model (HMM) based on the normalized sample coverage data and the fitted mixture model. An example method can comprise outputting the one or more copy number variants.
-
Citations
28 Claims
-
1. A method comprising:
-
receiving, by a computing device, a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SSQC) metrics; grouping, by the computing device, sets of sequencing quality control (SQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths; selecting, by the computing device, a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics; normalizing, by the computing device, the sample coverage data set and the reference panel; fitting, by the computing device, the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and identifying one or more copy number variants (CNVs) by comparing, by the computing device, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer readable medium comprising processor-executable instructions adapted to cause one or more computing devices to:
-
receive, by a computing device, a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SQC) metrics; group, by the computing device, sets of sequencing quality control (SSQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths; selecting, by the computing device, a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics; normalizing, by the computing device, the sample coverage data set and the reference panel; fit, by the computing device, the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and identify one or more copy number variants (CNVs) by comparing, by the computing device, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model. - View Dependent Claims (22, 23, 24)
-
-
25. An apparatus, comprising:
-
one or more processors; and a memory storing processor executable instructions that, when executed by the one or more processors, cause the apparatus to; receive a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SQC) metrics; group sets of sequencing quality control (SSQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths; selecting a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics; normalizing the sample coverage data set and the reference panel; fit the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and identify one or more copy number variants (CNVs) by comparing, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model. - View Dependent Claims (26, 27, 28)
-
Specification