Method and Apparatus for Protein Sequence Alignment Using FPGA Devices
First Claim
1. A data processing apparatus for sequence alignment searching, the apparatus comprising:
- a first stage configured to (1) receive a bit stream that is representative of a database sequence and (2) process the bit stream to produce a plurality of seeds, each seed being indicative of a similarity between a substring of the database sequence and a substring of a query sequence;
a second stage configured to perform an ungapped extension analysis on the seeds generated by the first stage, thereby generating a plurality of high scoring pairs (HSPs); and
a third stage configured to perform a gapped extension analysis on the HSPs generated by the second stage, thereby determining whether any HSPs exist that will produce alignment of interest between the database sequence and the query sequence; and
wherein the first stage, the second stage, and at least a portion of the third stage are implemented as a data processing pipeline on at least one hardware logic device.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein is a hardware implementation for performing sequence alignment that preferably deploys a seed generation stage, an ungapped extension stage, and at least a portion of a gapped extension stage as a data processing pipeline on at least one hardware logic device. Hardware circuits for the seed generation stage, the ungapped extension stage, and the gapped extension stage are individually disclosed. In a preferred embodiment, the pipeline is arranged for performing BLASTP sequence alignment searching. Also, in a preferred embodiment, the at least one hardware logic device comprises at least one reconfigurable logic device such as an FPGA.
-
Citations
99 Claims
-
1. A data processing apparatus for sequence alignment searching, the apparatus comprising:
-
a first stage configured to (1) receive a bit stream that is representative of a database sequence and (2) process the bit stream to produce a plurality of seeds, each seed being indicative of a similarity between a substring of the database sequence and a substring of a query sequence;
a second stage configured to perform an ungapped extension analysis on the seeds generated by the first stage, thereby generating a plurality of high scoring pairs (HSPs); and
a third stage configured to perform a gapped extension analysis on the HSPs generated by the second stage, thereby determining whether any HSPs exist that will produce alignment of interest between the database sequence and the query sequence; and
wherein the first stage, the second stage, and at least a portion of the third stage are implemented as a data processing pipeline on at least one hardware logic device. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for sequence alignment searching, the method comprising:
-
generating a plurality of seeds from a database sequence and a query sequence;
performing an ungapped extension analysis on the generated seeds, thereby generating a plurality of high scoring pairs (HSPs);
performing a gapped extension analysis on the generated HSPs, thereby determining whether any HSPs will produce an alignment of interest between the database sequence and the query sequence; and
performing the seed generating step, ungapped extension analysis step and at least a portion of the gapped extension analysis step in a pipelined fashion on at least one hardware logic device. - View Dependent Claims (7, 8, 9, 10, 11, 13)
-
-
12. The method of 7 wherein the at least one reconfigurable logic device comprises at least one field programmable gate array (FPGA).
-
14. A hardware configured data processing apparatus for sequence alignment searching, the apparatus comprising at least one hardware logic device configured to (1) receive a bit stream that is representative of a database sequence, (2) process the bit stream to produce a plurality of seeds, each seed being indicative of a similarity between a substring of the database sequence and a substring of a query sequence, (3) perform an ungapped extension analysis on the generated seeds, thereby generating a plurality of high scoring pairs (HSPs), and (4) perform a gapped extension analysis on the generated HSPs, thereby determining whether any HSPs exist that will produce alignment of interest between the database sequence and the query sequence.
-
15. A computer-readable medium for sequence alignment searching, the computer-readable medium comprising:
a data structure comprising (1) first logic for a hardware logic device, the first logic configured to (a) receive a bit stream that is representative of a database sequence and (b) process the bit stream to produce a plurality of seeds, each seed being indicative of a similarity between a substring of the database sequence and a substring of a query sequence, (2) second logic for a hardware logic device, the second logic configured to perform an ungapped extension analysis on the generated seeds, thereby generating a plurality of high scoring pairs (HSPs), and (3) third logic for a hardware logic device, the third logic configured to perform a gapped extension analysis on the generated HSPs, thereby determining whether any HSPs exist that will produce alignment of interest between the database sequence and the query sequence, wherein the data structure is configured such that the first logic, the second logic, and the third logic will be implemented as a data processing pipeline on at least one hardware logic device, and wherein the data structure is resident on the computer-readable medium. - View Dependent Claims (16)
-
17. A method of providing a data structure for conversion into a configuration template for configuring a hardware logic device, the method comprising:
-
providing a data structure that is convertible into a configuration template for loading onto a hardware logic device, the data structure comprising logic for a seed generation stage for a BLAST pipeline, an ungapped analysis stage for a BLAST pipeline, and a gapped analysis stage for a BLAST pipeline, and providing a tool for use in a process of converting the data structure into the configuration template.
-
-
18. A method of converting a data structure to a new data structure, the method comprising:
-
accessing a first data structure that is convertible into a configuration template for loading onto a hardware logic device, the first data structure comprising logic for a seed generation stage for a BLAST pipeline, an ungapped analysis stage for a BLAST pipeline, and a gapped analysis stage for a BLAST pipeline, and using at least one tool to convert the first data structure into a second data structure, the second data structure comprising one selected from the group consisting of (1) a new data structure that is also convertible into a configuration template for loading onto a hardware logic device, and (2) a configuration template for loading onto a hardware logic device.
-
-
19. An apparatus for sequence alignment searching to find a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the apparatus comprising:
-
a hit generator, the hit generator comprising (1) a memory configured as a lookup table and (2) a table lookup unit in communication with the memory;
wherein the lookup table is configured to store a plurality of hit identifiers, each hit identifier corresponding to a hit between a database substring and a query substring;
wherein the table lookup unit is configured to (1) receive a bit stream corresponding to a plurality of database substrings, and (2) for each received database substring, access the lookup table to retrieve therefrom each hit identifier corresponding to that database substring; and
wherein at least the table lookup unit is implemented on a hardware logic device. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A method for sequence alignment searching to find a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the method comprising:
-
maintaining a lookup table that stores a plurality of hit identifiers, each hit identifier corresponding to a hit between a database substring and a query substring;
receiving a bitstream comprising a plurality of database substrings;
for each received database substring, accessing the lookup table to retrieve therefrom each hit identifier corresponding to that database substring; and
wherein the receiving and accessing steps are performed by a hardware logic device. - View Dependent Claims (32, 33, 34, 35, 36)
-
-
37. A computer-readable medium for sequence alignment searching to find a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the computer-readable medium comprising:
a data structure comprising logic for a table lookup unit for accessing a lookup table, the lookup table being configured to store a plurality of hit identifiers, each hit identifier corresponding to a hit between a database substring and a query substring, the table lookup unit being configured to (1) receive a bit stream corresponding to a plurality of database substrings, and (2) for each received database substring, access the lookup table to retrieve therefrom each hit identifier corresponding to that database substring, wherein the data structure is configured such that the table lookup unit will be implemented on a hardware logic device, and wherein the data structure is resident on the computer-readable medium. - View Dependent Claims (38)
-
39. An apparatus for sequence alignment searching to find a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the apparatus comprising:
-
a word matching module configured to (1) receive a bitstream comprising a plurality of data substrings and (2) find a plurality of hits between the data substrings and a plurality of query substrings; and
a hit filtering module in communication with the word matching module, the hit filtering module configured to (1) receive a stream of hits from the word matching module and (2) filter the received hits based at least in part upon whether a plurality of hits are determined to be sufficiently close to each other in the database sequence; and
wherein the hit filtering module is implemented on a hardware logic device. - View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
-
-
52. A method for sequence alignment searching to find a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the method comprising:
-
receiving a bitstream comprising a plurality of data substrings;
finding a plurality of hits between the data substrings and a plurality of query substrings; and
filtering the received hits based at least in part upon whether a plurality of hits are determined to be sufficiently close to each other in the database sequence; and
wherein the filtering step is performed by a hardware logic device. - View Dependent Claims (53, 54, 55, 56, 57)
-
-
58. A computer-readable medium for sequence alignment searching to find a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the computer-readable medium comprising:
a data structure comprising (1) first logic for a word matching module configured to (a) receive a bitstream comprising a plurality of data substrings and (b) find a plurality of hits between the data substrings and a plurality of query substrings, and (2) second logic for a hit filtering module in communication with the word matching module, the hit filtering module configured to (a) receive a stream of hits from the word matching module and (b) filter the received hits based at least in part upon whether a plurality of hits are determined to be sufficiently close to each other in the database sequence, wherein the data structure is configured such that the word matching module and the hit filtering module will be implemented on a hardware logic device as a data processing pipeline, and wherein the data structure is resident on the computer-readable medium. - View Dependent Claims (59)
-
60. A method for populating a lookup table for use in finding hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the method comprising:
-
for each of a plurality of query substrings, determining each position in the query sequence therefor;
modular delta encoding the determined positions; and
storing the modular delta encoded positions in a lookup table that is addressed at least partially by a plurality of item substrings corresponding to all possible combinations of items having a length equal to the query substrings. - View Dependent Claims (61, 62, 63, 64, 65, 66, 67)
-
-
68. A computer readable medium for populating a lookup table for use in finding hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the computer readable medium comprising:
-
a code segment executable by a processor for, for each of a plurality of query substrings, determining each position in the query sequence therefor;
a code segment executable by a processor for modular delta encoding the determined positions; and
a code segment executable by a processor for storing the modular delta encoded positions in a lookup table that is addressed at least partially by a plurality of item substrings corresponding to all possible combinations of items having a length equal to the query substrings.
-
-
69. An apparatus for sequence alignment searching for finding pairs of interest corresponding to a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the apparatus comprising:
-
a module for ungapped extension analysis, the module being configured to (1) receive a stream of hits between a plurality of database substrings and a plurality of query substrings, and (2) identify which hits are high scoring pairs (HSPs) on the basis of a scoring matrix; and
wherein the ungapped extension analysis module is implemented on at least one hardware logic device. - View Dependent Claims (70, 71, 72, 73, 74)
-
-
75. A method for sequence alignment searching to find pairs of interest corresponding to a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the method comprising:
-
receiving a stream of hits between a plurality of database substrings and a plurality of query substrings;
performing an ungapped extension analysis on the received hits using a scoring matrix to identify which hits are high scoring pairs (HSPs); and
wherein the receiving step and the performing step are performed by at least one hardware logic device. - View Dependent Claims (76, 77, 78, 79, 80)
-
-
81. A computer-readable medium for sequence alignment searching for finding pairs of interest corresponding to a plurality of hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the computer-readable medium comprising:
a data structure comprising logic for a module for ungapped extension analysis, the module being configured to (1) receive a stream of hits between a plurality of database substrings and a plurality of query substrings, and (2) identify which hits are high scoring pairs (HSPs) on the basis of a scoring matrix, wherein the data structure is configured such that the ungapped extension analysis module will be implemented on a hardware logic device, and wherein the data structure is resident on the computer-readable medium. - View Dependent Claims (82)
-
83. An apparatus for sequence alignment searching using hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the apparatus comprising:
-
a module for gapped extension analysis, the module being configured to (1) receive a stream of hits between a plurality of database substrings and a plurality of query substrings, and (2) identify the hits in the hit stream for which there exists an alignment of interest that exceeds a threshold using a banded Smith-Waterman algorithm;
wherein the gapped extension analysis module is implemented on at least one hardware logic device. - View Dependent Claims (84, 85, 86, 87)
-
-
88. A method for sequence alignment searching using hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the method comprising:
-
receiving a stream of hits between a plurality of database substrings and a plurality of query substrings; and
performing a banded Smith-Waterman gapped extension analysis on the received hits to identify whether any hits will produce an alignment that exceeds a threshold within a band geometry; and
wherein the receiving step and the performing step are performed by at least one hardware logic device. - View Dependent Claims (89)
-
-
90. A computer-readable medium for sequence alignment searching using hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the computer-readable medium comprising:
a data structure comprising logic for a module for gapped extension analysis, the module being configured to (1) receive a stream of hits between a plurality of database substrings and a plurality of query substrings, and (2) identify the hits in the hit stream for which there exists an alignment of interest that exceeds a threshold using at least one selected from the group consisting of a banded Smith-Waterman algorithm and a seeded Smith-Waterman algorithm, wherein the data structure is configured such that the gapped extension analysis module will be implemented on a hardware logic device, and wherein the data structure is resident on the computer-readable medium. - View Dependent Claims (91)
-
92. An apparatus for sequence alignment searching using hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the apparatus comprising:
-
a module for gapped extension analysis, the module being configured to (1) receive a stream of hits between a plurality of database substrings and a plurality of query substrings, and (2) identify hits within the hit stream for which there exists an alignment that exceeds a threshold using a seeded Smith-Waterman algorithm;
wherein the gapped extension analysis module is implemented on at least one hardware logic device. - View Dependent Claims (93)
-
-
94. A method for sequence alignment searching using hits between a plurality of database substrings and a plurality of query substrings, each database substring corresponding to a plurality of items of a database sequence and each query substring corresponding to a plurality of items of a query sequence, the method comprising:
-
receiving a stream of hits between a plurality of database substrings and a plurality of query substrings; and
performing a gapped extension analysis on the received hits using a seeded Smith-Waterman algorithm to identify whether any hits correspond to an alignment of interest; and
wherein the receiving step and the performing step are performed by at least one hardware logic device. - View Dependent Claims (95)
-
-
96. A computer-implemented method comprising:
-
storing a first template that defines a first sequence analysis algorithm;
storing a second template that defines a second sequence analysis algorithm;
selecting one of the stored templates; and
loading the selected template onto a reconfigurable logic device to thereby configure that reconfigurable logic device to perform a sequence analysis corresponding to the loaded template when a database sequence is streamed through the reconfigurable logic device. - View Dependent Claims (97)
-
-
98. A computer-implemented method comprising:
-
selecting whether a first sequence analysis algorithm or a second sequence analysis algorithm is to be performed;
defining a template for the selected one of the first and second sequence analysis algorithms; and
loading the defined template onto a reconfigurable logic device to thereby configure that reconfigurable logic device to perform a sequence analysis corresponding to the loaded template when a database sequence is streamed through the reconfigurable logic device. - View Dependent Claims (99)
-
Specification