Methods and systems for detecting sequence variants

US 9,904,763 B2
Filed: 06/29/2016
Issued: 02/27/2018
Est. Priority Date: 08/21/2013
Status: Active Grant

First Claim

Patent Images

1. A method of identifying variations in sequence data, the method comprising:

using at least one computer hardware processor to perform;

receiving a plurality of nucleotide sequence reads from a genomic sample, wherein at least one sequence read comprises a mutation and at least a portion of a structural variation;

representing, in at least one non-transitory computer-readable storage medium connected to the computer hardware processor, a reference sequence and a plurality of known variations from the reference sequence as a reference graph, wherein the reference graph is an assembled construct represented as a directed graph stored in the at least one non-transitory computer-readable storage medium, the reference graph comprising a first node representing a conserved portion of the reference sequence, the first node connected by directed edges to a second node and a third node, the second node representing a first alternative sequence and the third node representing a second alternative sequence, the second alternative sequence comprising a sequence matching the structural variation;

mapping, using the computer hardware processor, the at least one sequence read to the reference graph, the mapping comprising determining a location on the reference graph for which a score for the at least one sequence read is maximized, wherein the determined location of the mapped at least one sequence read spans the first node and the third node; and

identifying the mutation within the mapped at least one sequence read with respect to the conserved portion of the reference sequence represented by the first node.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.

Citations

20 Claims

1. A method of identifying variations in sequence data, the method comprising:
- using at least one computer hardware processor to perform;
  
  receiving a plurality of nucleotide sequence reads from a genomic sample, wherein at least one sequence read comprises a mutation and at least a portion of a structural variation;
  
  representing, in at least one non-transitory computer-readable storage medium connected to the computer hardware processor, a reference sequence and a plurality of known variations from the reference sequence as a reference graph, wherein the reference graph is an assembled construct represented as a directed graph stored in the at least one non-transitory computer-readable storage medium, the reference graph comprising a first node representing a conserved portion of the reference sequence, the first node connected by directed edges to a second node and a third node, the second node representing a first alternative sequence and the third node representing a second alternative sequence, the second alternative sequence comprising a sequence matching the structural variation;
  
  mapping, using the computer hardware processor, the at least one sequence read to the reference graph, the mapping comprising determining a location on the reference graph for which a score for the at least one sequence read is maximized, wherein the determined location of the mapped at least one sequence read spans the first node and the third node; and
  
  identifying the mutation within the mapped at least one sequence read with respect to the conserved portion of the reference sequence represented by the first node.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, wherein the direction of the directed edges is towards the second and third nodes.
  - 3. The method of claim 1, wherein the direction of the directed edges is towards the first node.
  - 4. The method of claim 1, wherein the reference graph is acyclic.
  - 5. The method of claim 1, further comprising mapping a second sequence read to the reference graph.
  - 6. The method of claim 5, wherein the second sequence read does not comprise the mutation and at least a portion of the structural variation.
  - 7. The method of claim 5, wherein the determined location of the mapped second sequence read spans only the first node.
  - 8. The method of claim 1, wherein mapping the at least one sequence read to the reference graph comprises calculating a scoring matrix for each node.
  - 9. The method of claim 8, wherein the scoring matrix considers scores from predecessor nodes.
  - 10. The method of claim 1, wherein a length of the at least one sequence read is less than 500 base pairs.
  - 11. The method of claim 1, wherein the mutation and the structural variation are separated by 100 bp or fewer.
  - 12. The method of claim 1, wherein the structural variation is 100 bp to 3 megabases in length.
  - 13. The method of claim 1, wherein the structural variation is a marker for disease.
  - 14. The method of claim 1, wherein identifying the mutation within the mapped at least one sequence read comprises comparing the mapped at least one sequence read with the first node and the third node by comparing the mapped at least one sequence read with the conserved portion of the reference sequence and the second alternative sequence.
  - 15. The method of claim 1, wherein the mutation comprises a single nucleotide polymorphism.
  - 16. The method of claim 1, further comprising adding the mutation to the reference graph.
  - 17. The method of claim 16, wherein adding the mutation to the reference graph comprises:
    - segmenting the first node, at a position of the mutation in the conserved portion of the reference sequence, into two segmented nodes, the two nodes representing the 5′
      
      sequence and 3′
      
      sequence of the conserved portion of the reference sequence adjacent to the mutation, respectively;
      
      associating the mutation with a mutation node;
      
      associating the conserved portion of the reference sequence at the mutation with a reference node;
      
      connecting, within the reference graph, the two segmented nodes, the mutation node, and the reference node with a plurality of directed edges, such that a first path through the reference graph includes the mutation, and a second path through the reference graph includes the reference sequence.
  - 18. The method of claim 1, further comprising mapping the plurality of sequence reads to the reference graph, identifying a set of mapped sequence reads overlapping the mutation, and determining whether the mutation is present in the identified set of mapped sequence reads at a frequency of less than 5%.
  - 19. The method of claim 18, further comprising determining whether the mutation is present in the identified set of mapped sequence reads at a frequency of less than 1%.

20. A system for identifying variations in sequence data, the system comprising:
- at least one computer hardware processor; and
  
  at least one non-transitory computer-readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to;
  
  receive a plurality of nucleotide sequence reads from a genomic sample, wherein at least one sequence read comprises a mutation and at least a portion of a structural variation;
  
  represent a reference sequence and a plurality of known variations as a reference graph, the reference graph comprising a first node representing a conserved portion of the reference sequence, the first node connected by directed edges to a second node and a third node, the second node representing a first alternative sequence and the third node representing a second alternative sequence, the second alternative sequence comprising a sequence matching the structural variation, wherein the reference graph is an assembled construct represented as a directed graph stored in the computer-readable storage medium;
  
  map the at least one sequence read to the reference graph, the mapping comprising determining a location on the reference graph for which a score for the sequence read is maximized, wherein the determined location of the mapped at least one sequence read spans the first node and the third node; and
  
  identify the mutation within the mapped at least one sequence read with respect to the conserved portion of the reference sequence represented by the first node.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Seven Bridges Genomics, Inc.
Original Assignee
Seven Bridges Genomics, Inc.
Inventors
Kural, Deniz
Primary Examiner(s)
Martinell, James

Application Number

US15/196,345
Publication Number

US 20160306921A1
Time in Patent Office

608 Days
Field of Search

None
US Class Current
CPC Class Codes

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

G16B 30/20   Sequence assembly

G16B 50/00   ICT programming tools or da...

Methods and systems for detecting sequence variants

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for detecting sequence variants

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links