Methods and systems for copy number variant detection

US 10,395,759 B2
Filed: 05/18/2015
Issued: 08/27/2019
Est. Priority Date: 05/18/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by a computing device, a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SSQC) metrics;

grouping, by the computing device, sets of sequencing quality control (SQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths;

selecting, by the computing device, a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics;

normalizing, by the computing device, the sample coverage data set and the reference panel;

fitting, by the computing device, the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and

identifying one or more copy number variants (CNVs) by comparing, by the computing device, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for determining copy number variants are disclosed. An example method can comprise applying a sample grouping technique to select reference coverage data, normalizing sample coverage data comprising a plurality of genomic regions, and fitting a mixture model to the normalized sample coverage data based on the selected reference coverage data. An example method can comprise identifying one or more copy number variants (CNVs) according to a Hidden Markov Model (HMM) based on the normalized sample coverage data and the fitted mixture model. An example method can comprise outputting the one or more copy number variants.

Citations

28 Claims

1. A method comprising:
- receiving, by a computing device, a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SSQC) metrics;
  
  grouping, by the computing device, sets of sequencing quality control (SQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths;
  
  selecting, by the computing device, a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics;
  
  normalizing, by the computing device, the sample coverage data set and the reference panel;
  
  fitting, by the computing device, the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and
  
  identifying one or more copy number variants (CNVs) by comparing, by the computing device, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1, wherein selecting the reference panel of reference coverage data sets using the multidimensional tree data structure comprises:
    - defining a distance metric between the SSQC metrics and the SQC metrics of the reference coverage data sets; and
      
      selecting the reference panel of the reference coverage data sets based on the distance metric.
  - 3. The method of claim 1, wherein grouping the sets of SQC metrics comprises use of a clustering algorithm, a classification algorithm, or a combination thereof.
  - 4. The method of claim 1, wherein grouping the sets of SQC metrics comprises use of a k-Nearest Neighbors (knn) algorithm and wherein the method further comprises:
    - scaling the sets of SQC metrics of the reference coverage data sets;
      
      scaling the simples SSQC metrics;
      
      wherein grouping the sets of SQC metrics into the multidimensional tree data structure according to similarity comprises, generating a k-d tree based on the scaled sets of SQC of the reference coverage data;
      
      adding the scaled SSQC metrics to the k-d tree; and
      
      wherein selecting the reference panel of reference coverage data sets using the multidimensional tree data structure comprises, identifying a predetermined number of nearest neighbors to the SSQC metrics as the selected reference coverage data sets.
  - 5. The method of claim 1, further comprising dividing the plurality of genomic regions into one or more calling windows.
  - 6. The method of claim 5, wherein normalizing the sample coverage data set comprises:
    - determining raw coverage for a calling window w;
      
      determining a median coverage for the sample coverage data set across the one or more calling windows conditional on a GC-fraction of the calling window w; and
      
      dividing the raw coverage by the median coverage, resulting in the normalized sample coverage data set.
  - 7. The method of claim 6, wherein determining a median coverage for the sample coverage data set across the plurality of windows conditional on the GC-fraction of the calling window w comprises:
    - binning the one or more calling windows by GC-fraction, resulting in a plurality of bins;
      
      determining a median coverage for each bin of the plurality of bins; and
      
      determining a normalizing factor for each distinct possible GC-fraction using a linear interpolation between the median coverage for two bins nearest to the calling window w.
  - 8. The method of claim 1, further comprising filtering the sample coverage data set.
  - 9. The method of claim 8, wherein filtering the sample coverage data set comprises:
    - filtering one or more calling windows based on a mappability score of a genomic region of the plurality of genomic regions; and
      
      filtering the one or more calling windows based on occurrence of a calling window in a multi-copy duplication genomic region.
  - 10. The method of claim 9, wherein filtering the one or more calling windows based on the mappability score comprises:
    - determining a mappability score for each genomic region of the plurality of genomic regions; and
      
      excluding a calling window of the one or more calling windows that contains the genomic region of the plurality of genomic regions if the mappability score of the genomic region of the plurality of genomic regions is below a predetermined threshold.
  - 11. The method of claim 9, wherein filtering the one or more calling windows based on occurrence of the calling window in a multi-copy duplication genomic region comprises:
    - excluding a calling window of the one or more calling windows if the calling window of the one or more calling windows occurs within a region where multi-copy duplications are known to be present.
  - 12. The method of claim 1, wherein fitting the normalized reference panel to the mixture model to determine the expected coverage distribution comprises:
    - determining a plurality of mixture models, one for each of the plurality of genomic regions, wherein each component of the plurality of mixture models comprises a probability distribution that represents an expected normalized coverage conditional on a particular copy number; and
      
      fitting the plurality of mixture models to the normalized reference panel data using an expectation-maximization algorithm to determine a likelihood for each copy number at each of the one or more calling windows, wherein the normalized reference panel is input to the expectation-maximization algorithm.
  - 13. The method of claim 12, wherein identifying one or more copy number variants (CNVs) by comparing, according to the HMM, the normalized sample coverage data set to an expected coverage distribution from the mixture model to identify one or more CNVs comprises:
    - inputting the normalized sample coverage data set for each calling window of the one or more calling windows into the HMM;
      
      determining one or more emission probabilities of the HMM based on the mixture model; and
      
      identifying a calling window of the one or more calling windows as a CNV if a maximum likelihood sequence of states of the calling window is non-diploid.
  - 14. The method of claim 13, wherein determining one or more emission probabilities of the HMM based on the mixture model comprises:
    - determining a probability of observing a normalized coverage value x, at a calling window w of the one or more calling windows, given HMM state s, based on a component of the mixture model for w that corresponds to state s.
  - 15. The method of claim 13, wherein identifying the calling window of the one or more calling windows as a CNV if a maximum likelihood sequence of states of the calling window is non-diploid comprises:
    - performing a Viterbi algorithm in a 5′
      
      to 3′
      
      direction on a genomic region of the plurality of genomic regions;
      
      performing the Viterbi algorithm in a 3′
      
      to 5′
      
      direction the genomic region of the plurality of genomic regions; and
      
      identifying the calling window of the one or more calling windows as a CNV if the genomic region of the plurality of genomic regions associated with the calling window has a most-likely state of non-diploid in the 5′
      
      to 3′
      
      direction and the 3′
      
      to 5′
      
      direction.
  - 16. The method of claim 1, wherein the multidimensional tree data structure is a kd-tree data structure.
  - 17. The method of claim 1, wherein selecting the reference panel of reference coverage data sets using the multidimensional tree data structure comprises selecting a predetermined number of sets of SQC metrics from the multidimensional tree data structure and respective associated reference coverage data sets.
  - 18. The method of claim 17, wherein the predetermined number of sets of SQC metrics is less than a number of total reference coverage data sets thereby decreasing usage of a computational resource of one or more computing devices.
  - 19. The method of claim 1, further comprising sequencing the nucleic acid samples from the subject.
  - 20. The method of claim 1, wherein normalizing the sample coverage data set and the reference panel is performed via parallel processing.

21. A computer readable medium comprising processor-executable instructions adapted to cause one or more computing devices to:
- receive, by a computing device, a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SQC) metrics;
  
  group, by the computing device, sets of sequencing quality control (SSQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths;
  
  selecting, by the computing device, a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics;
  
  normalizing, by the computing device, the sample coverage data set and the reference panel;
  
  fit, by the computing device, the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and
  
  identify one or more copy number variants (CNVs) by comparing, by the computing device, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model.
- View Dependent Claims (22, 23, 24)
- - 22. The computer readable medium of claim 21, wherein the processor-executable instructions adapted to cause the one or more computing devices to select the reference panel of reference coverage data sets using the multidimensional tree data structure comprise processor-executable instructions adapted to cause the one or more computing devices to:
    - define a distance metric between the SSQC metrics and the sets of SQC metrics of the reference coverage data sets; and
      
      select the reference coverage data sets based on the distance metric.
  - 23. The computer readable medium of claim 21, further comprising sequencing the nucleic acid samples from the subject.
  - 24. The computer readable medium of claim 21, wherein the processor executable instructions that, when executed by the one or more processors, cause the apparatus to normalize the sample coverage data set and the reference panel via parallel processing.

25. An apparatus, comprising:
- one or more processors; and
  
  a memory storing processor executable instructions that, when executed by the one or more processors, cause the apparatus to;
  
  receive a sample coverage data set comprising a plurality of genomic sequences obtained from sequencing of nucleic acid samples of a subject, and sample sequencing quality control (SQC) metrics;
  
  group sets of sequencing quality control (SSQC) metrics into a multidimensional tree data structure according to similarity, wherein each set of SQC metrics is associated with a respective reference coverage data set that comprises a plurality of genomic regions and read depths;
  
  selecting a reference panel of reference coverage data sets using the multidimensional tree data structure, wherein the selected reference coverage data sets have SQC metrics similar to the SSQC metrics;
  
  normalizing the sample coverage data set and the reference panel;
  
  fit the normalized reference panel to a mixture model at each of the plurality of genomic regions to determine an expected coverage distribution at each of the plurality of genomic regions; and
  
  identify one or more copy number variants (CNVs) by comparing, according to a Hidden Markov Model (HMM), the normalized sample coverage data set to the expected coverage distribution at each of the plurality of genomic regions from the mixture model.
- View Dependent Claims (26, 27, 28)
- - 26. The apparatus of claim 25, wherein the processor executable instructions that, when executed by the one or more processors, cause the apparatus to select the reference panel of reference coverage data sets using the multidimensional tree data structure comprise processor executable instructions that, when executed by the one or more processors, cause the apparatus to:
    - define a distance metric between the SSQC metrics and the sets of SQC metrics of the reference coverage data sets; and
      
      select the reference coverage data sets based on the distance metric.
  - 27. The apparatus of claim 25, further comprising sequencing the nucleic acid samples from the subject.
  - 28. The apparatus of claim 25, wherein the processor executable instructions that, when executed by the one or more processors, cause the apparatus to normalize the sample coverage data set and the reference panel via parallel processing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regeneron Pharmaceuticals Incorporated
Original Assignee
Regeneron Pharmaceuticals Incorporated
Inventors
Reid, Jeffrey, Habegger, Lukas, Packer, Jonathan, Maxwell, Evan
Primary Examiner(s)
Skibinsky, Anna

Application Number

US14/714,949
Publication Number

US 20160342733A1
Time in Patent Office

1,562 Days
Field of Search

None
US Class Current
CPC Class Codes

C12Q 1/6869   Methods for sequencing

G16B 20/00   ICT specially adapted for f...

G16B 20/10   Ploidy or copy number detec...

G16B 20/20   Allele or variant detection...

G16B 30/00   ICT specially adapted for s...

G16B 40/00   ICT specially adapted for b...

G16B 40/30   Unsupervised data analysis

Methods and systems for copy number variant detection

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for copy number variant detection

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links