EFFICIENT GENOMIC READ ALIGNMENT IN AN IN-MEMORY DATABASE

US 20140214334A1
Filed: 01/27/2014
Published: 07/31/2014
Est. Priority Date: 01/28/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-based system for processing nucleotide sequence data, which are provided as reads, wherein the system has an interface for importing the nucleotide sequence data from a sequencer machine (M), comprising:

a platform layer for holding process logic and an in-memory database system (IMDB) for processing nucleotide sequence data, wherein the platform layer comprises;

a worker framework with a plurality of workers, wherein each worker is running on a node of a cluster and wherein the workers are processing in parallel, wherein all results and intermediate results are stored in the in-memory database (IMDB), and with an alignment coordinator, which is adapted to provide the in-memory database system (IMDB) with a modified alignment functionality.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A high performance, low-cost, gapped read alignment algorithm is disclosed that produces high quality alignments of a complete human genome in a few minutes. Additionally, the algorithm is more than an order of magnitude faster than previous approaches using a low-cost workstation. The results are obtained via careful algorithm engineering of the seeding based approach. The use of non-hashed seeds in combination with techniques from search engine ranking achieves fast cache-efficient processing. The algorithm can also be efficiently parallelized. Integration into an in-memory database infrastructure (IMDB) leads to low overhead for data management and further analysis.

70 Citations

View as Search Results

16 Claims

1. A computer-based system for processing nucleotide sequence data, which are provided as reads, wherein the system has an interface for importing the nucleotide sequence data from a sequencer machine (M), comprising:
- a platform layer for holding process logic and an in-memory database system (IMDB) for processing nucleotide sequence data, wherein the platform layer comprises;
  
  a worker framework with a plurality of workers, wherein each worker is running on a node of a cluster and wherein the workers are processing in parallel, wherein all results and intermediate results are stored in the in-memory database (IMDB), and with an alignment coordinator, which is adapted to provide the in-memory database system (IMDB) with a modified alignment functionality.
- View Dependent Claims (2, 3, 4)
- - 2. The system according to claim 1, wherein the system further comprises:
    - an updater framework for automatically downloading and importing annotation updates from external sources into the in-memory database (IMDB).
  - 3. The system according to claim 1, wherein the system further comprises:
    - a user interface (UI) with at least a genome browser, which comprisesa section for displaying a comparison of the nucleotide sequence and multiple referenced cell lines/genomes and/or a reference sequence,a section for displaying combined analysis information from multiple external databases, anda section for selecting instructions for data processing for specific pipeline configurations.
  - 4. The system according to claim 3, wherein the specific pipeline configurations are an alignment of the genomic sequence data.

5. A computer-implemented method for processing human or non-human nucleotide sequence data with an in-memory database (IMDB), the method comprising:
- providing a cluster with a set of computing nodes with multiple CPU cores, each implementing a worker for parallel data processing,providing nucleotide sequence data as reads in the in-memory database (IMDB) and performing data processing concurrently to sequencing, wherein the data processing comprises;
  
  aligning chunks of the read in parallel on the set of computing nodes and aggregating partial aligning results (AR) to an alignment result to be stored in the in-memory database (IMDB).
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 6. The method according to claim 5, further comprising:
    - executing variant calling in parallel on the set of computing nodes and aggregating partial variant calling results (VCR) to a variant calling result, andautomatically analyzing the variant calling result by accessing an updater framework in the in-memory database (IMDB), wherein the updater framework regularly and automatically checks a plurality of different external annotation sources for updates and which automatically downloads and imports said updates in the in-memory database (IMDB).
  - 7. The method according to claim 5, wherein the alignment is directly implemented in the in-memory database system (IMDB).
  - 8. The method according to claim 5, wherein the alignment is seed-based and a search strategy is used in order to evaluate previous matches for applying heuristics for earlier termination conditions.
  - 9. The method according to claim 5, wherein the alignment is based on a heuristic means in order to apply efficient algorithms to a high fraction of the reads and, optionally, to apply complex alignment algorithms to a small fraction of the reads.
  - 10. The method according to claim 5, wherein long hit lists are used for scoring previously found matches and/or hits.
  - 11. The method according to claim 5, wherein short hit lists are handled separately from long hit lists and are used for scoring of previously found matches and for finding new positions.
  - 12. The method according to claim 5, wherein alignment is based on a double indexing, in that hits from each of two subsequent seeds in a reference genome are combined and stored in a separate smaller index structure once a configurable threshold for seed matches in the two subsequent seeds is exceeded.
  - 13. The method according to claim 5, wherein alignment is executed on workers in parallel on different processing nodes in a distributed system and beyond boundaries of a computer node or processor.
  - 14. The method according to claim 5, wherein for alignment each read is divided into non-overlapping seeds.
  - 15. The method according to claim 5, wherein a single or a two-array index data structure is generated and stored in the in-memory database (IMDB).
  - 16. The method according to claim 5, wherein an index data structure is replicated over local memory of processor sockets or over multiple nodes of the cluster in order to allow for on-the-fly read alignments on a massive parallel machine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hasso-Plattner-Institut für Softwaresystemtechnik GmbH
Original Assignee
Hasso-Plattner-Institut für Softwaresystemtechnik GmbH
Inventors
Plattner, Hasso, Schapranow, Matthieu-Patrick, Ziegler, Emanuel

Granted Patent

US 10,381,106 B2
Time in Patent Office

Days
Field of Search
US Class Current

702/19
CPC Class Codes

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

G16B 45/00   ICT specially adapted for b...

G16B 50/00   ICT programming tools or da...

G16B 50/30   Data warehousing; Computing...

EFFICIENT GENOMIC READ ALIGNMENT IN AN IN-MEMORY DATABASE

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

70 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

EFFICIENT GENOMIC READ ALIGNMENT IN AN IN-MEMORY DATABASE

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

70 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links