Methods and systems for detecting sequence variants
First Claim
1. A method of identifying variations in sequence data, the method comprising:
- using at least one computer hardware processor to perform;
receiving a plurality of nucleotide sequence reads from a genomic sample, wherein at least one sequence read comprises a mutation and at least a portion of a structural variation;
representing, in at least one non-transitory computer-readable storage medium connected to the computer hardware processor, a reference sequence and a plurality of known variations from the reference sequence as a reference graph, wherein the reference graph is an assembled construct represented as a directed graph stored in the at least one non-transitory computer-readable storage medium, the reference graph comprising a first node representing a conserved portion of the reference sequence, the first node connected by directed edges to a second node and a third node, the second node representing a first alternative sequence and the third node representing a second alternative sequence, the second alternative sequence comprising a sequence matching the structural variation;
mapping, using the computer hardware processor, the at least one sequence read to the reference graph, the mapping comprising determining a location on the reference graph for which a score for the at least one sequence read is maximized, wherein the determined location of the mapped at least one sequence read spans the first node and the third node; and
identifying the mutation within the mapped at least one sequence read with respect to the conserved portion of the reference sequence represented by the first node.
10 Assignments
0 Petitions
Accused Products
Abstract
The invention provides methods for identifying rare variants near a structural variation in a genetic sequence, for example, in a nucleic acid sample taken from a subject. The invention additionally includes methods for aligning reads (e.g., nucleic acid reads) to a reference sequence construct accounting for the structural variation, methods for building a reference sequence construct accounting for the structural variation or the structural variation and the rare variant, and systems that use the alignment methods to identify rare variants. The method is scalable, and can be used to align millions of reads to a construct thousands of bases long, or longer.
-
Citations
20 Claims
-
1. A method of identifying variations in sequence data, the method comprising:
using at least one computer hardware processor to perform; receiving a plurality of nucleotide sequence reads from a genomic sample, wherein at least one sequence read comprises a mutation and at least a portion of a structural variation; representing, in at least one non-transitory computer-readable storage medium connected to the computer hardware processor, a reference sequence and a plurality of known variations from the reference sequence as a reference graph, wherein the reference graph is an assembled construct represented as a directed graph stored in the at least one non-transitory computer-readable storage medium, the reference graph comprising a first node representing a conserved portion of the reference sequence, the first node connected by directed edges to a second node and a third node, the second node representing a first alternative sequence and the third node representing a second alternative sequence, the second alternative sequence comprising a sequence matching the structural variation; mapping, using the computer hardware processor, the at least one sequence read to the reference graph, the mapping comprising determining a location on the reference graph for which a score for the at least one sequence read is maximized, wherein the determined location of the mapped at least one sequence read spans the first node and the third node; and identifying the mutation within the mapped at least one sequence read with respect to the conserved portion of the reference sequence represented by the first node. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
20. A system for identifying variations in sequence data, the system comprising:
-
at least one computer hardware processor; and at least one non-transitory computer-readable storage medium storing processor executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to; receive a plurality of nucleotide sequence reads from a genomic sample, wherein at least one sequence read comprises a mutation and at least a portion of a structural variation; represent a reference sequence and a plurality of known variations as a reference graph, the reference graph comprising a first node representing a conserved portion of the reference sequence, the first node connected by directed edges to a second node and a third node, the second node representing a first alternative sequence and the third node representing a second alternative sequence, the second alternative sequence comprising a sequence matching the structural variation, wherein the reference graph is an assembled construct represented as a directed graph stored in the computer-readable storage medium; map the at least one sequence read to the reference graph, the mapping comprising determining a location on the reference graph for which a score for the sequence read is maximized, wherein the determined location of the mapped at least one sequence read spans the first node and the third node; and identify the mutation within the mapped at least one sequence read with respect to the conserved portion of the reference sequence represented by the first node.
-
Specification