System and methods for detecting genetic variation

US 9,092,401 B2
Filed: 10/31/2012
Issued: 07/28/2015
Est. Priority Date: 10/31/2012
Status: Active Grant

First Claim

Patent Images

1. A method of detecting genetic variation in a subject'"'"'s genome comprising:

(a) providing a plurality of clusters of polynucleotides, wherein (i) each cluster comprises multiple copies of a nucleic acid duplex attached to a support;

(ii) each duplex in a cluster comprises a first molecule comprising sequences A-B-G′

-D′

-C′

from 5′

to 3′ and

a second molecule comprising sequences C-D-G-B′

-A′

from 5′

to 3′

;

(iii) sequence A′

is complementary to sequence A, sequence B′

is complementary to sequence B, sequence C′

is complementary to sequence C, sequence D′

is complementary to sequence D, and sequence G′

is complementary to sequence G;

(iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;

(v) sequence B′

is located 5′

with respect to sequence G in the corresponding target polynucleotide sequence; and

(vi) each first molecule comprises a barcode sequence;

(b) sequencing sequence G′

by extension of a first primer comprising sequence D to produce an R1 sequence for each cluster;

(c) sequencing sequence B′

by extension of a second primer comprising sequence A to produce R2 sequence for each cluster;

(d) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;

(e) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;

(f) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;

(g) transmitting a report identifying sequence variation identified by steps (d) to (f) to a receiver; and

(h) hybridizing a third primer to sequence C′ and

sequencing the barcode sequence by extension of the third primer to produce a barcode sequence for each cluster.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides methods, apparatuses, and compositions for high-throughput amplification sequencing of specific target sequences in one or more samples. In some aspects, barcode-tagged polynucleotides are sequenced simultaneously and sample sources are identified on the basis of barcode sequences. In some aspects, sequencing data are used to determine one or more genotypes at one or more loci comprising a causal genetic variant. In some aspects, systems and methods of detecting genetic variation are provided.

62 Citations

View as Search Results

103 Claims

1. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing a plurality of clusters of polynucleotides, wherein (i) each cluster comprises multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprises a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  ;
  
  (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence; and
  
  (vi) each first molecule comprises a barcode sequence;
  
  (b) sequencing sequence G′
  
  by extension of a first primer comprising sequence D to produce an R1 sequence for each cluster;
  
  (c) sequencing sequence B′
  
  by extension of a second primer comprising sequence A to produce R2 sequence for each cluster;
  
  (d) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (e) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (f) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;
  
  (g) transmitting a report identifying sequence variation identified by steps (d) to (f) to a receiver; and
  
  (h) hybridizing a third primer to sequence C′ and
  
  sequencing the barcode sequence by extension of the third primer to produce a barcode sequence for each cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 2. The method of claim 1, wherein the first reference sequence comprises a reference genome.
  - 3. The method of claim 1, wherein the second reference sequence consists of every sequence B for every different target polynucleotide.
  - 4. The method of claim 1, wherein R2 sequences are aligned independently of R1 sequences.
  - 5. The method of claim 1, further comprising discarding an R1 sequence that aligns to a first position in the first reference sequence that is more than 10,000 base pairs away from a second position in the first reference sequence to which the R2 sequence for the same cluster aligns.
  - 6. The method of claim 1, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of sequence B′
    - for that cluster and sequence G is shorter than the R1 sequence for that cluster.
  - 7. The method of claim 1, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of any sequence B′
    - , the portion includes either the 5′
      
      or 3′
      
      nucleotide of R1, and either (i) no R2 sequence was produced for the cluster or (ii) R2 sequence produced is not identical to any sequence B.
  - 8. The method of claim 1, wherein performing the first alignment with a system using the first algorithm takes less time to align all R1 reads than would be taken if the system used the second algorithm to perform the first alignment.
  - 9. The method of claim 1, wherein performing the first alignment with a system using the first algorithm uses less system memory to align all R1 reads than would be used if the system used the second algorithm to perform the first alignment.
  - 10. The method of claim 1, wherein said first algorithm is based on Burrows-Wheeler transform.
  - 11. The method of claim 1, wherein said second algorithm is based on Smith-Waterman algorithm or a hash function.
  - 12. The method of claim 1, wherein R1 and R2 sequences are generated for at least 100 different target polynucleotides.
  - 13. The method of claim 1, wherein each barcode differs from every other barcode in a plurality of different barcodes analyzed in parallel.
  - 14. The method of claim 1, wherein the barcode sequence is associated with a single sample in a pool of samples sequenced in a single reaction.
  - 15. The method of claim 1, wherein each of a plurality of barcode sequences is uniquely associated with a single sample in a pool of samples sequenced in a single reaction.
  - 16. The method of claim 1, wherein the barcode sequence is located 5′
    - from sequence D′
      
      .
  - 17. The method of claim 1, further comprising grouping sequences from the clusters based on the barcode sequences.
  - 18. The method of claim 17, further comprising discarding all but one of a plurality of R1 sequences having the same sequence and alignment within a barcode sequence grouping.
  - 19. The method of claim 1, wherein sequences A, B, C, and D are at least 5 nucleotides in length.
  - 20. The method of claim 1, wherein sequence G of every cluster is 1 to 1000 nucleotides in length.
  - 21. The method of claim 1, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a causal genetic variant or a sequence within 200 nucleotides of a causal genetic variant.
  - 22. The method of claim 1, wherein an R1 sequence is produced for at least about 10⁸clusters in a single reaction.
  - 23. The method of claim 1, wherein presence, absence, or allele ratio of one or more causal genetic variants is determined with an accuracy of at least about 90%.
  - 24. The method of claim 1, wherein the consensus sequence identifies an insertion, a deletion, or an insertion and a deletion in a target polynucleotide with an accuracy of at least about 90%.
  - 25. The method of claim 1, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a non-subject sequence or a sequence within 200 nucleotides of a non-subject sequence.
  - 26. The method of claim 1, wherein the presence or absence of one or more non-subject sequences is determined with an accuracy of at least about 90%.

27. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing sequencing data for a plurality of clusters of polynucleotides, wherein (i) each cluster comprised multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprised a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  ;
  
  (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence;
  
  (vi) the sequencing data comprise R1 sequences generated by extension of a first primer comprising sequence D;
  
  (vii) the sequencing data comprise R2 sequences generated by extension of a second primer comprising sequence A;
  
  (viii) each first molecule comprises a barcode sequence; and
  
  (ix) the sequencing data comprise a barcode sequence for each cluster generated by extension of a third primer comprising sequence C;
  
  (b) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (c) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (d) performing an R2 alignment by aligning all R2 sequences to a second reference sequence; and
  
  (e) transmitting a report identifying sequence variation identified by steps (b) to (d) to a receiver.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52)
- - 28. The method of claim 27, wherein the first reference sequence comprises a reference genome.
  - 29. The method of claim 27, wherein the second reference sequence consists of every sequence B for every different target polynucleotide.
  - 30. The method of claim 27, wherein R2 sequences are aligned independently of R1 sequences.
  - 31. The method of claim 27, further comprising discarding an R1 sequence that aligns to a first position in the first reference sequence that is more than 10,000 base pairs away from a second position in the first reference sequence to which the R2 sequence for the same cluster aligns.
  - 32. The method of claim 27, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of sequence B′
    - for that cluster and sequence G is shorter than the R1 sequence for that cluster.
  - 33. The method of claim 27, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of any sequence B′
    - , the portion includes either the 5′
      
      or 3′
      
      nucleotide of R1, and either (i) no R2 sequence was produced for the cluster or (ii) R2 sequence produced is not identical to any sequence B.
  - 34. The method of claim 27, wherein performing the first alignment with a system using the first algorithm takes less time to align all R1 reads than would be taken if the system used the second algorithm to perform the first alignment.
  - 35. The method of claim 27, wherein performing the first alignment with a system using the first algorithm uses less system memory to align all R1 reads than would be used if the system used the second algorithm to perform the first alignment.
  - 36. The method of claim 27, wherein said first algorithm is based on Burrows-Wheeler transform.
  - 37. The method of claim 27, wherein said second algorithm is based on Smith-Waterman algorithm or a hash function.
  - 38. The method of claim 27, wherein the sequencing data comprise R1 and R2 sequences for at least 100 different target polynucleotides.
  - 39. The method of claim 27, wherein each barcode differs from every other barcode in a plurality of different barcodes analyzed in parallel.
  - 40. The method of claim 27, wherein the barcode sequence is associated with a single sample in a pool of samples sequenced in a single reaction and represented in the sequencing data.
  - 41. The method of claim 27, wherein each of a plurality of barcode sequences is uniquely associated with a single sample in a pool of samples sequenced in a single reaction.
  - 42. The method of claim 27, wherein the barcode sequence is located 5′
    - from sequence D′
      
      .
  - 43. The method of claim 27, further comprising grouping sequences from the clusters based on the barcode sequences.
  - 44. The method of claim 43, further comprising discarding all but one of a plurality of R1 sequences having the same sequence and alignment within a barcode sequence grouping.
  - 45. The method of claim 27, wherein sequences A, B, C, and D are at least 5 nucleotides in length.
  - 46. The method of claim 27, wherein sequence G of every cluster is 1 to 1000 nucleotides in length.
  - 47. The method of claim 27, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a causal genetic variant or a sequence within 200 nucleotides of a causal genetic variant.
  - 48. The method of claim 27, wherein sequencing data comprise at least about 10⁸R1 sequences from a single reaction.
  - 49. The method of claim 27, wherein presence, absence, or allele ratio of one or more causal genetic variants is determined with an accuracy of at least about 90%.
  - 50. The method of claim 27, wherein the consensus sequence identifies an insertion, a deletion, or an insertion and a deletion in a target polynucleotide with an accuracy of at least about 90%.
  - 51. The method of claim 27, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a non-subject sequence or a sequence within 200 nucleotides of a non-subject sequence.
  - 52. The method of claim 27, wherein the presence or absence of one or more non-subject sequence is determined with an accuracy of at least about 90%.

53. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing a plurality of clusters of polynucleotides, wherein (i) each cluster comprises multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprises a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  , (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence; and
  
  (vi) each first molecule comprises a barcode sequence;
  
  (b) sequencing sequence G′
  
  by extension of a first primer comprising sequence D to produce an R1 sequence for each cluster;
  
  (c) sequencing sequence B′
  
  by extension of a second primer comprising sequence A to produce R2 sequence for each cluster;
  
  (d) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (e) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (f) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;
  
  (g) calculating a plurality of probabilities based on the R1 sequences for the subject and including the probabilities in a report identifying sequence variation identified by steps (d) to (f), wherein each probability is a probability of the subject or a subject'"'"'s offspring having or developing a disease or trait;
  
  (h) transmitting the report to a receiver; and
  
  (i) hybridizing a third primer to sequence C′ and
  
  sequencing the barcode sequence by extension of the third primer to produce a barcode sequence for each cluster.
- View Dependent Claims (54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76)
- - 54. The method of claim 53, wherein the first reference sequence comprises a reference genome.
  - 55. The method of claim 53, wherein the second reference sequence consists of every sequence B for every different target polynucleotide.
  - 56. The method of claim 53, wherein R2 sequences are aligned independently of R1 sequences.
  - 57. The method of claim 53, further comprising discarding an R1 sequence that aligns to a first position in the first reference sequence that is more than 10,000 base pairs away from a second position in the first reference sequence to which the R2 sequence for the same cluster aligns.
  - 58. The method of claim 53, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of sequence B′
    - for that cluster and sequence G is shorter than the R1 sequence for that cluster.
  - 59. The method of claim 53, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of any sequence B′
    - , the portion includes either the 5′
      
      or 3′
      
      nucleotide of R1, and either (i) no R2 sequence was produced for the cluster or (ii) R2 sequence produced is not identical to any sequence B.
  - 60. The method of claim 53, wherein performing the first alignment with a system using the first algorithm takes less time to align all R1 reads than would be taken if the system used the second algorithm to perform the first alignment.
  - 61. The method of claim 53, wherein performing the first alignment with a system using the first algorithm uses less system memory to align all R1 reads than would be used if the system used the second algorithm to perform the first alignment.
  - 62. The method of claim 53, wherein said first algorithm is based on Burrows-Wheeler transform.
  - 63. The method of claim 53, wherein said second algorithm is based on Smith-Waterman algorithm or a hash function.
  - 64. The method of claim 53, wherein R1 and R2 sequences are generated for at least 100 different target polynucleotides.
  - 65. The method of claim 53, wherein each barcode differs from every other barcode in a plurality of different barcodes analyzed in parallel.
  - 66. The method of claim 53, wherein the barcode sequence is associated with a single sample in a pool of samples sequenced in a single reaction.
  - 67. The method of claim 53, wherein each of a plurality of barcode sequences is uniquely associated with a single sample in a pool of samples sequenced in a single reaction.
  - 68. The method of claim 53, wherein the barcode sequence is located 5′
    - from sequence D′
      
      .
  - 69. The method of claim 53, wherein sequences A, B, C, and D are at least 5 nucleotides in length.
  - 70. The method of claim 53, wherein sequence G of every cluster is 1 to 1000 nucleotides in length.
  - 71. The method of claim 53, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a causal genetic variant or a sequence within 200 nucleotides of a causal genetic variant.
  - 72. The method of claim 53, wherein an R1 sequence is produced for at least about 10⁸clusters in a single reaction.
  - 73. The method of claim 53, wherein presence, absence, or allele ratio of one or more causal genetic variants is determined with an accuracy of at least about 90%.
  - 74. The method of claim 53, wherein the consensus sequence identifies an insertion, a deletion, or an insertion and a deletion in a target polynucleotide with an accuracy of at least about 90%.
  - 75. The method of claim 53, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a non-subject sequence or a sequence within 200 nucleotides of a non-subject sequence.
  - 76. The method of claim 53, wherein the presence or absence of one or more non-subject sequences is determined with an accuracy of at least about 90%.

77. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing a plurality of clusters of polynucleotides, wherein (i) each cluster comprises multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprises a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  ;
  
  (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence; and
  
  (vi) each first molecule comprises a barcode sequence;
  
  (b) sequencing sequence G′
  
  by extension of a first primer comprising sequence D to produce an R1 sequence for each cluster;
  
  (c) sequencing sequence B′
  
  by extension of a second primer comprising sequence A to produce R2 sequence for each cluster;
  
  (d) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (e) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (f) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;
  
  (g) calculating a plurality of probabilities based on the R1 sequences for the subject and including the probabilities in a report identifying sequence variation identified by steps (d) to (f), wherein each probability is a probability of the subject or a subject'"'"'s offspring having or developing a disease or trait;
  
  (h) transmitting the report to a receiver;
  
  (i) hybridizing a third primer to sequence C′ and
  
  sequencing the barcode sequence by extension of the third primer to produce a barcode sequence for each cluster; and
  
  (j) grouping sequences from the clusters based on the barcode sequences.

78. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing a plurality of clusters of polynucleotides, wherein (i) each cluster comprises multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprises a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  ;
  
  (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence; and
  
  (vi) each first molecule comprises a barcode sequence;
  
  (b) sequencing sequence G′
  
  by extension of a first primer comprising sequence D to produce an R1 sequence for each cluster;
  
  (c) sequencing sequence B′
  
  by extension of a second primer comprising sequence A to produce R2 sequence for each cluster;
  
  (d) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (e) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (f) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;
  
  (g) calculating a plurality of probabilities based on the R1 sequences for the subject and including the probabilities in a report identifying sequence variation identified by steps (d) to (f), wherein each probability is a probability of the subject or a subject'"'"'s offspring having or developing a disease or trait;
  
  (h) transmitting the report to a receiver;
  
  (i) hybridizing a third primer to sequence C′ and
  
  sequencing the barcode sequence by extension of the third primer to produce a barcode sequence for each cluster(j) grouping sequences from the clusters based on the barcode sequences; and
  
  (k) discarding all but one of a plurality of R1 sequences having the same sequence and alignment within a barcode sequence grouping.

79. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing sequencing data for a plurality of clusters of polynucleotides, wherein (i) each cluster comprised multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprised a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  ;
  
  (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence;
  
  (vi) the sequencing data comprise R1 sequences generated by extension of a first primer comprising sequence D;
  
  (vii) the sequencing data comprise R2 sequences generated by extension of a second primer comprising sequence A, (viii) each first molecule comprises a barcode sequence, and (ix) wherein the sequencing data further comprises a barcode sequence for each cluster generated by extension of a third primer comprising sequence C;
  
  (b) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (c) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (d) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;
  
  (e) calculating a plurality of probabilities based on the R1 sequences for the subject and including the probabilities in a report identifying sequence variation identified by steps (b) to (d), wherein each probability is a probability of the subject or a subject'"'"'s offspring having or developing a disease or trait; and
  
  (f) transmitting the report to a receiver.
- View Dependent Claims (80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101)
- - 80. The method of claim 79, wherein the first reference sequence comprises a reference genome.
  - 81. The method of claim 79, wherein the second reference sequence consists of every sequence B for every different target polynucleotide.
  - 82. The method of claim 79, wherein R2 sequences are aligned independently of R1 sequences.
  - 83. The method of claim 79, further comprising discarding an R1 sequence that aligns to a first position in the first reference sequence that is more than 10,000 base pairs away from a second position in the first reference sequence to which the R2 sequence for the same cluster aligns.
  - 84. The method of claim 79, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of sequence B′
    - for that cluster and sequence G is shorter than the R1 sequence for that cluster.
  - 85. The method of claim 79, further comprising deleting a portion of an R1 sequence for a cluster when the portion of R1 sequence to be deleted is identical to at least a portion of any sequence B′
    - , the portion includes either the 5′
      
      or 3′
      
      nucleotide of R1, and either (i) no R2 sequence was produced for the cluster or (ii) R2 sequence produced is not identical to any sequence B.
  - 86. The method of claim 79, wherein performing the first alignment with a system using the first algorithm takes less time to align all R1 reads than would be taken if the system used the second algorithm to perform the first alignment.
  - 87. The method of claim 79, wherein performing the first alignment with a system using the first algorithm uses less system memory to align all R1 reads than would be used if the system used the second algorithm to perform the first alignment.
  - 88. The method of claim 79, wherein said first algorithm is based on Burrows-Wheeler transform.
  - 89. The method of claim 79, wherein said second algorithm is based on Smith-Waterman algorithm or a hash function.
  - 90. The method of claim 79, wherein the sequencing data comprise R1 and R2 sequences for at least 100 different target polynucleotides.
  - 91. The method of claim 79, wherein each barcode differs from every other barcode in a plurality of different barcodes analyzed in parallel.
  - 92. The method of claim 79, wherein the barcode sequence is associated with a single sample in a pool of samples sequenced in a single reaction and represented in the sequencing data.
  - 93. The method of claim 79, wherein each of a plurality of barcode sequences is uniquely associated with a single sample in a pool of samples sequenced in a single reaction.
  - 94. The method of claim 79, wherein the barcode sequence is located 5′
    - from sequence D′
      
      .
  - 95. The method of claim 79, wherein sequences A, B, C, and D are at least 5 nucleotides in length.
  - 96. The method of claim 79, wherein sequence G of every cluster is 1 to 1000 nucleotides in length.
  - 97. The method of claim 79, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a causal genetic variant or a sequence within 200 nucleotides of a causal genetic variant.
  - 98. The method of claim 79, wherein presence, absence, or allele ratio of one or more causal genetic variants is determined with an accuracy of at least about 90%.
  - 99. The method of claim 79, wherein the consensus sequence identifies an insertion, a deletion, or an insertion and a deletion in a target polynucleotide with an accuracy of at least about 90%.
  - 100. The method of claim 79, wherein each probe sequence B of a plurality of clusters is complementary to a sequence comprising a non-subject sequence or a sequence within 200 nucleotides of a non-subject sequence.
  - 101. The method of claim 79, wherein the presence or absence of one or more non-subject sequence is determined with an accuracy of at least about 90%.

102. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing sequencing data for a plurality of clusters of polynucleotides, wherein (i) each cluster comprised multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprised a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  ;
  
  (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence;
  
  (vi) the sequencing data comprise R1 sequences generated by extension of a first primer comprising sequence D;
  
  (vii) the sequencing data comprise R2 sequences generated by extension of a second primer comprising sequence A, (viii) each first molecule comprises a barcode sequence, (ix) wherein the sequencing data further comprises a barcode sequence for each cluster generated by extension of a third primer comprising sequence C; and
  
  (x) grouping sequences from the clusters based on the barcode sequences;
  
  (b) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (c) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (d) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;
  
  (e) calculating a plurality of probabilities based on the R1 sequences for the subject and including the probabilities in a report identifying sequence variation identified by steps (b) to (d), wherein each probability is a probability of the subject or a subject'"'"'s offspring having or developing a disease or trait; and
  
  (f) transmitting the report to a receiver.

103. A method of detecting genetic variation in a subject'"'"'s genome comprising:
- (a) providing sequencing data for a plurality of clusters of polynucleotides, wherein (i) each cluster comprised multiple copies of a nucleic acid duplex attached to a support;
  
  (ii) each duplex in a cluster comprised a first molecule comprising sequences A-B-G′
  
  -D′
  
  -C′
  
  from 5′
  
  to 3′ and
  
  a second molecule comprising sequences C-D-G-B′
  
  -A′
  
  from 5′
  
  to 3′
  
  ;
  
  (iii) sequence A′
  
  is complementary to sequence A, sequence B′
  
  is complementary to sequence B, sequence C′
  
  is complementary to sequence C, sequence D′
  
  is complementary to sequence D, and sequence G′
  
  is complementary to sequence G;
  
  (iv) sequence G is a portion of a target polynucleotide sequence from a subject and is different for each of a plurality of clusters;
  
  (v) sequence B′
  
  is located 5′
  
  with respect to sequence G in the corresponding target polynucleotide sequence;
  
  (vi) the sequencing data comprise R1 sequences generated by extension of a first primer comprising sequence D;
  
  (vii) the sequencing data comprise R2 sequences generated by extension of a second primer comprising sequence A, (viii) each first molecule comprises a barcode sequence, (ix) wherein the sequencing data further comprises a barcode sequence for each cluster generated by extension of a third primer comprising sequence C;
  
  (x) grouping sequences from the clusters based on the barcode sequences; and
  
  (xi) discarding all but one of a plurality of R1 sequences having the same sequence and alignment within a barcode sequence grouping;
  
  (b) performing a first alignment using a first algorithm to align all R1 sequences to a first reference sequence;
  
  (c) performing a second alignment using a second algorithm to locally align R1 sequences identified in said first alignment as likely to contain an insertion or deletion with respect to the first reference sequence, to produce a single consensus alignment for each insertion or deletion;
  
  (d) performing an R2 alignment by aligning all R2 sequences to a second reference sequence;
  
  (e) calculating a plurality of probabilities based on the R1 sequences for the subject and including the probabilities in a report identifying sequence variation identified by steps (b) to (d), wherein each probability is a probability of the subject or a subject'"'"'s offspring having or developing a disease or trait; and
  
  (f) transmitting the report to a receiver.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Myriad Women's Health Incorporated (Myriad Genetics)
Original Assignee
Counsyl Incorporated (Myriad Genetics, Inc.)
Inventors
Richards, Hunter, Evans, Eric, Srinivasan, Balaji, Srinivasan, Subramaniam, Patterson, A. Scott, Chu, Clement, Shah, Abhik
Primary Examiner(s)
Vivlemore, Tracy
Assistant Examiner(s)
Weiler, Karen S

Application Number

US13/665,671
Publication Number

US 20140121116A1
Time in Patent Office

1,000 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

C12Q 1/6874   involving nucleic acid arra...

C40B 30/00   Methods of screening libraries

G16B 20/00   ICT specially adapted for f...

G16B 20/10   Ploidy or copy number detec...

G16B 20/20   Allele or variant detection...

G16B 20/40   Population genetics; Linkag...

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

System and methods for detecting genetic variation

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

62 Citations

103 Claims

Specification

Solutions

Use Cases

Quick Links

System and methods for detecting genetic variation

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

62 Citations

103 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links