Systems and methods for genomic pattern analysis

US 10,192,026 B2
Filed: 03/04/2016
Issued: 01/29/2019
Est. Priority Date: 03/05/2015
Status: Active Grant

First Claim

Patent Images

1. A method for analyzing a genetic sequence, the method comprising:

obtaining a reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the graph, wherein at least one path through the graph represents substantially an entire chromosome;

identifying a data string for each path of the plurality of paths through the graph, each data string representing a concatenation of the substrings of genomic sequence and known variation in the genomic sequence stored in objects through the path;

for each data string;

identifying a plurality of k-mers in the data string; and

listing each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains locations of all k-mers having that index;

obtaining a query sequence;

identifying a plurality of query k-mers from the query sequence;

determining the locations of at least one query k-mer within the graph by reading search index entries indexed according to hashes of query k-mers; and

identifying portions of the graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the query sequence.

View all claims

13 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides methods for analyzing sequence data in which a large amount and variety of reference data are efficiently modeled as a reference graph, such as a directed acyclic graph (DAG). The method includes determining positions of k-mers within a reference graph that represents a genomic sequence and known variation, storing the positions of each k-mer in a table entry indexed by a hash of that k-mer, and identifying a region within the reference graph that includes a threshold number of the k-mers by reading from the table entries indexed by hashes of substrings of a subject sequence. The subject sequence may subsequently be mapped to the candidate region.

124 Citations

20 Claims

1. A method for analyzing a genetic sequence, the method comprising:
- obtaining a reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the graph, wherein at least one path through the graph represents substantially an entire chromosome;
  
  identifying a data string for each path of the plurality of paths through the graph, each data string representing a concatenation of the substrings of genomic sequence and known variation in the genomic sequence stored in objects through the path;
  
  for each data string;
  
  identifying a plurality of k-mers in the data string; and
  
  listing each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains locations of all k-mers having that index;
  
  obtaining a query sequence;
  
  identifying a plurality of query k-mers from the query sequence;
  
  determining the locations of at least one query k-mer within the graph by reading search index entries indexed according to hashes of query k-mers; and
  
  identifying portions of the graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the query sequence.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein the query sequence is from a subject organism.
  - 3. The method of claim 1, wherein identifying a plurality of query k-mers from the query sequence further comprises identifying a plurality of query k-mers for the reverse complement of the query sequence.
  - 4. The method of claim 1, wherein each entry of the search index comprises an ordered list of locations for that indexed k-mer.
  - 5. The method of claim 1, wherein each k-mer comprises a sequence of symbols with fixed length k.
  - 6. The method of claim 5, wherein the fixed length k is within an interval [8, 14].
  - 7. The method of claim 1, wherein listing each k-mer'"'"'s location within the graph further comprises excluding locations identified in a previous path.
  - 8. The method of claim 1, wherein the locations comprise floating point projections for alternate branches.
  - 9. The method of claim 1, wherein sub strings are stored in edge objects connected to one another by vertex objects to form a plurality of paths through the graph.
  - 10. The method of claim 1, further comprising identifying a best-fit location of the query sequence to the graph by performing a local alignment of the query sequence to each candidate target.
  - 11. The method of claim 10, wherein performing a local alignment comprises using a multidimensional Smith-Waterman algorithm.
  - 12. The method of claim 1, wherein the query sequence is a non-contiguous query.
  - 13. The method of claim 12, wherein the non-contiguous query comprises a pair of paired-end sequence reads.
  - 14. The method of claim 13, further comprising identifying portions of the graph in which the number of potential matches with different query k-mers is equal to or exceeds a threshold number within a window of ordered locations.
  - 15. The method of claim 14, wherein the window has a size that is less than 1000 base pairs.
  - 16. The method of claim 1, wherein the chromosome is a human chromosome.

17. A system for analyzing a genetic sequence, the system comprising:
- a tangible memory subsystem storing;
  
  a reference graph, the reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the reference graph; and
  
  a processor executing instructions configured to;
  
  identify a plurality of paths through the reference graph, each path representing a concatenation of the substrings of the genomic sequence and known variation in the genomic sequence stored in objects through the path;
  
  for each path of the plurality of paths, identify a plurality of k-mers in the path, and list each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains an ordered list of locations of all k-mers having that index;
  
  receive a paired-end sequence read comprising a 5′
  
  sequence read and a 3′
  
  sequence read; and
  
  determine candidate targets for alignment of the sequence read by;
  
  identifying a plurality of query k-mers from the 5′
  
  sequence read and the 3′
  
  sequence read;
  
  determining the locations of each query k-mer within the reference graph by reading search index entries indexed according to hashes of query k-mers; and
  
  identifying portions of the reference graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the sequence read wherein the identifying comprises;
  
  creating a global ordering of locations corresponding to query k-mers from the 5′
  
  sequence read and the 3′
  
  sequence read; and
  
  identifying locations in the global ordering in which both the 5′
  
  sequence read and 3′
  
  sequence read have query k-mers within a window.
- View Dependent Claims (18)
- - 18. The system of claim 17, wherein the window has a size less than 1000 base pairs.

19. A method for analyzing a genetic sequence, the method comprising:
- obtaining a reference graph representing a genomic sequence and known variation in the genomic sequence, in which substrings of the genomic sequence and known variation are stored in objects connected to one another to form a plurality of paths through the graph, wherein at least one path through the graph represents substantially an entire genome;
  
  identifying a data string for each path of the plurality of paths through the graph, each data string representing a concatenation of the substrings of genomic sequence and known variation in the genomic sequence stored in objects through the path;
  
  for each data string;
  
  identifying a plurality of k-mers in the data string; and
  
  listing each identified k-mer'"'"'s location within the graph in an entry in a search index, wherein that entry is indexed according to a hash of that k-mer and contains locations of all k-mers having that index;
  
  obtaining a query sequence;
  
  identifying a plurality of query k-mers from the query sequence;
  
  determining the locations of at least one query k-mer within the graph by reading search index entries indexed according to hashes of query k-mers; and
  
  identifying portions of the graph in which a number of potential matches with different query k-mers is equal to or exceeds a threshold number as candidate targets within the graph for alignment of segments of the query sequence.
- View Dependent Claims (20)
- - 20. The method of claim 19, wherein substrings are stored in edge objects connected to one another by vertex objects to form a plurality of paths through the graph.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Seven Bridges Genomics, Inc.
Original Assignee
Seven Bridges Genomics, Inc.
Inventors
Semenyuk, Vladimir
Primary Examiner(s)
Negin, Russell S

Application Number

US15/061,235
Publication Number

US 20160259880A1
Time in Patent Office

1,061 Days
Field of Search

None
US Class Current
CPC Class Codes

G16B 15/00   ICT specially adapted for a...

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

Systems and methods for genomic pattern analysis

First Claim

13 Assignments

0 Petitions

Accused Products

Abstract

124 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for genomic pattern analysis

First Claim

13 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

124 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links