Methods and systems for detecting sequence variants

US 10,325,675 B2
Filed: 02/27/2018
Issued: 06/18/2019
Est. Priority Date: 08/21/2013
Status: Active Grant

First Claim

Patent Images

1. A method for identifying a structural variation, the method comprising:

representing, in at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence;

receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation;

determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and

identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.

Citations

20 Claims

1. A method for identifying a structural variation, the method comprising:
- representing, in at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence;
  
  receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation;
  
  determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and
  
  identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the reference sequence represents a chromosome or a genome.
  - 3. The method of claim 1, wherein at least one path through the reference graph represents substantially a chromosome or a genome of an organism.
  - 4. The method of claim 1, wherein at least one path through the reference graph comprises an insertion, deletion, or substitution in the reference sequence.
  - 5. The method of claim 1, wherein at least one node further comprises diagnostic information associated with its string of one or more symbols.
  - 6. The method of claim 5, wherein the diagnostic information is a risk of cancer.
  - 7. The method of claim 1, wherein the sequence reads are from a subject, the method further comprising assigning a genotype to the subject based upon the location of one or more aligned sequence reads.
  - 8. The method of claim 7, further comprising correlating the assigned genotype with a risk of disease for the subject.
  - 9. The method of claim 1, wherein the reference graph represents a species.
  - 10. The method of claim 1, further comprising adding a deletion to the reference graph by breaking the reference sequence into nodes before and after the deletion, and crating two paths between the nodes, one path representing the reference sequence and the other path representing the deletion.
  - 11. The method of claim 1, wherein the reference graph is created by applying entries from a variant list stored in the storage medium as a text file.
  - 12. The method of claim 1, wherein each node further comprises a set of parent nodes, wherein the set of parent nodes define the edges.
  - 13. The method of claim 1, further comprising identifying a variation present in the one or more sequence reads that is not present in the reference graph, and adding that variation to the reference graph.
  - 14. The method of claim 1, further comprising identifying new mutations in the one or more sequence reads and recursively adding the mutations to the reference graph.

15. A system for identifying a structural variation, the system comprising:
- at least one computer hardware processor; and
  
  at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform;
  
  representing, in the at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence;
  
  receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation;
  
  determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and
  
  identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The system of claim 15, wherein the plurality of nodes further comprises a first node, a second node, and a third node, the third node having the first node and the second node as parent nodes, and wherein each of the first node, second node, and third node comprise different strings comprising a plurality of symbols.
  - 17. The system of claim 16, wherein the second node represents the structural variation.
  - 18. The system of claim 15, wherein the sequence reads are from a subject, the method further comprising assigning a genotype to the subject based upon the location of one or more aligned sequence reads.
  - 19. The system of claim 15, further comprising correlating the assigned genotype with a risk of disease for the subject.

20. At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform:
- representing, in the at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence;
  
  receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation;
  
  determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and
  
  identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Seven Bridges Genomics, Inc.
Original Assignee
Seven Bridges Genomics, Inc.
Inventors
Kural, Deniz
Primary Examiner(s)
Martinell, James

Application Number

US15/906,404
Publication Number

US 20180336314A1
Time in Patent Office

476 Days
Field of Search

None
US Class Current
CPC Class Codes

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

G16B 30/20   Sequence assembly

G16B 50/00   ICT programming tools or da...

Methods and systems for detecting sequence variants

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for detecting sequence variants

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links