Methods and systems for detecting sequence variants
First Claim
1. A method for identifying a structural variation, the method comprising:
- representing, in at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence;
receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation;
determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and
identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths.
6 Assignments
0 Petitions
Accused Products
Abstract
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
-
Citations
20 Claims
-
1. A method for identifying a structural variation, the method comprising:
-
representing, in at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence; receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation; determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A system for identifying a structural variation, the system comprising:
-
at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform; representing, in the at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence; receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation; determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths. - View Dependent Claims (16, 17, 18, 19)
-
-
20. At least one non-transitory computer-readable storage medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform:
-
representing, in the at least one tangible, non-transitory computer-readable storage medium, a reference sequence and variation of the reference sequence as a reference graph, the reference graph comprising a plurality of nodes and edges, wherein conserved regions of the reference sequence are represented as single nodes and regions that vary are represented as alternate nodes, wherein at least one of the alternate nodes comprises a structural variation not present in the reference sequence; receiving one or more nucleotide sequence reads from a nucleic acid sample, wherein at least one sequence read comprises at least a portion of the structural variation; determining optimal-scoring alignments between the one or more sequence reads and one or more paths within the reference graph, wherein the determining comprises considering two or more alternative paths by looking backward to any prior nodes on the reference graph to find a maximum score for the one or more sequence reads; and identifying the structural variation as present in the nucleic acid sample based on the optimal-scoring alignments between the one or more sequence reads and the one or more paths.
-
Specification