Methods and systems for aligning sequences
First Claim
Patent Images
1. A system for aligning a plurality of sequence reads, the system comprising:
- a processor; and
a tangible, non-transitory memory storing a plurality of sequence reads from a subject, and a reference directed acyclic graph (DAG) representing a reference sequence and genetic variation of the reference sequence, wherein the reference DAG comprises a plurality of nodes, wherein each node is stored in the memory as a string of one or more symbols and a set of parent nodes, the string representing a sequence and the set of parent nodes defining the node'"'"'s position with respect to other nodes in the reference DAG;
wherein the plurality of nodes further comprises a first node, a second node, and a third node, the third node having the first node and the second node as parent nodes, and wherein each of the first node, second node, and third node comprise different strings comprising a plurality of symbols;
wherein the memory further comprises instructions that, when executed, cause the processor to align a sequence read of the plurality of sequence reads to the reference DAG by;
creating, in the memory, a matrix for each respective node of the reference DAG, the creating comprising creating for each of the first node, the second node, and the third node, a matrix representing a comparison between the sequence read and a string associated with its respective node;
for each created matrix, scoring overlaps between the sequence read and the string associated with its respective node, the scoring comprising calculating a score for each entry of the matrix based at least in part on whether a symbol of the sequence read matches a corresponding symbol of the string associated with its respective node;
wherein calculating a score further comprises determining highest scores from matrices from parent nodes if and only if a matrix entry comprises the first symbol of the string associated with its node, such that scores for matrix entries for the first symbol of the string associated with the third node comprise a highest score from the matrix of the first node and the matrix of the second node;
identifying a highest overall score in the calculated scores, the highest overall score having a position in one of the created matrices;
backtracking from the position of the highest overall score to produce an actual match for the sequence read to the reference DAG, thereby aligning the sequence read to the reference DAG; and
writing a file to the memory corresponding to the actual match for the sequence read.
12 Assignments
0 Petitions
Accused Products
Abstract
The invention includes methods for aligning reads (e.g., nucleic acid reads, amino acid reads) to a reference sequence construct, methods for building the reference sequence construct, and systems that use the alignment methods and constructs to produce sequences. The method is scalable, and can be used to align millions of reads to a construct thousands of bases or amino acids long. The invention additionally includes methods for identifying a disease or a genotype based upon alignment of nucleic acid reads to a location in the construct.
-
Citations
27 Claims
-
1. A system for aligning a plurality of sequence reads, the system comprising:
-
a processor; and a tangible, non-transitory memory storing a plurality of sequence reads from a subject, and a reference directed acyclic graph (DAG) representing a reference sequence and genetic variation of the reference sequence, wherein the reference DAG comprises a plurality of nodes, wherein each node is stored in the memory as a string of one or more symbols and a set of parent nodes, the string representing a sequence and the set of parent nodes defining the node'"'"'s position with respect to other nodes in the reference DAG; wherein the plurality of nodes further comprises a first node, a second node, and a third node, the third node having the first node and the second node as parent nodes, and wherein each of the first node, second node, and third node comprise different strings comprising a plurality of symbols; wherein the memory further comprises instructions that, when executed, cause the processor to align a sequence read of the plurality of sequence reads to the reference DAG by; creating, in the memory, a matrix for each respective node of the reference DAG, the creating comprising creating for each of the first node, the second node, and the third node, a matrix representing a comparison between the sequence read and a string associated with its respective node; for each created matrix, scoring overlaps between the sequence read and the string associated with its respective node, the scoring comprising calculating a score for each entry of the matrix based at least in part on whether a symbol of the sequence read matches a corresponding symbol of the string associated with its respective node; wherein calculating a score further comprises determining highest scores from matrices from parent nodes if and only if a matrix entry comprises the first symbol of the string associated with its node, such that scores for matrix entries for the first symbol of the string associated with the third node comprise a highest score from the matrix of the first node and the matrix of the second node; identifying a highest overall score in the calculated scores, the highest overall score having a position in one of the created matrices; backtracking from the position of the highest overall score to produce an actual match for the sequence read to the reference DAG, thereby aligning the sequence read to the reference DAG; and writing a file to the memory corresponding to the actual match for the sequence read. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
Specification