Systems and methods for genomic pattern analysis
First Claim
1. A method for analyzing a genetic sequence, the method comprising:
- obtaining a reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the graph, wherein at least one path through the graph represents substantially an entire chromosome;
identifying a data string for each path of the plurality of paths through the graph, each data string representing a concatenation of the substrings of genomic sequence and known variation in the genomic sequence stored in objects through the path;
for each data string;
identifying a plurality of k-mers in the data string; and
listing each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains locations of all k-mers having that index;
obtaining a query sequence;
identifying a plurality of query k-mers from the query sequence;
determining the locations of at least one query k-mer within the graph by reading search index entries indexed according to hashes of query k-mers; and
identifying portions of the graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the query sequence.
13 Assignments
0 Petitions
Accused Products
Abstract
The invention provides methods for analyzing sequence data in which a large amount and variety of reference data are efficiently modeled as a reference graph, such as a directed acyclic graph (DAG). The method includes determining positions of k-mers within a reference graph that represents a genomic sequence and known variation, storing the positions of each k-mer in a table entry indexed by a hash of that k-mer, and identifying a region within the reference graph that includes a threshold number of the k-mers by reading from the table entries indexed by hashes of substrings of a subject sequence. The subject sequence may subsequently be mapped to the candidate region.
124 Citations
20 Claims
-
1. A method for analyzing a genetic sequence, the method comprising:
-
obtaining a reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the graph, wherein at least one path through the graph represents substantially an entire chromosome; identifying a data string for each path of the plurality of paths through the graph, each data string representing a concatenation of the substrings of genomic sequence and known variation in the genomic sequence stored in objects through the path; for each data string; identifying a plurality of k-mers in the data string; and listing each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains locations of all k-mers having that index; obtaining a query sequence; identifying a plurality of query k-mers from the query sequence; determining the locations of at least one query k-mer within the graph by reading search index entries indexed according to hashes of query k-mers; and identifying portions of the graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the query sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for analyzing a genetic sequence, the system comprising:
-
a tangible memory subsystem storing; a reference graph, the reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the reference graph; and a processor executing instructions configured to; identify a plurality of paths through the reference graph, each path representing a concatenation of the substrings of the genomic sequence and known variation in the genomic sequence stored in objects through the path; for each path of the plurality of paths, identify a plurality of k-mers in the path, and list each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains an ordered list of locations of all k-mers having that index; receive a paired-end sequence read comprising a 5′
sequence read and a 3′
sequence read; anddetermine candidate targets for alignment of the sequence read by; identifying a plurality of query k-mers from the 5′
sequence read and the 3′
sequence read;determining the locations of each query k-mer within the reference graph by reading search index entries indexed according to hashes of query k-mers; and identifying portions of the reference graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the sequence read wherein the identifying comprises; creating a global ordering of locations corresponding to query k-mers from the 5′
sequence read and the 3′
sequence read; andidentifying locations in the global ordering in which both the 5′
sequence read and 3′
sequence read have query k-mers within a window. - View Dependent Claims (18)
-
-
19. A method for analyzing a genetic sequence, the method comprising:
-
obtaining a reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the graph, wherein at least one path through the graph represents substantially an entire genome; identifying a data string for each path of the plurality of paths through the graph, each data string representing a concatenation of the substrings of genomic sequence and known variation in the genomic sequence stored in objects through the path; for each data string; identifying a plurality of k-mers in the data string; and listing each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains locations of all k-mers having that index; obtaining a query sequence; identifying a plurality of query k-mers from the query sequence; determining the locations of at least one query k-mer within the graph by reading search index entries indexed according to hashes of query k-mers; and identifying portions of the graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the query sequence. - View Dependent Claims (20)
-
Specification