Method and apparatus for performing similarity searching

US 10,580,518 B2
Filed: 01/11/2017
Issued: 03/03/2020
Est. Priority Date: 03/03/2005
Status: Active Grant

First Claim

Patent Images

1. A system for generating a hash table for use in comparing a first biosequence string with a second biosequence string to assess similarity between the first and second biosequence strings, the system comprising:

a processor configured to provide hashing on a plurality of substrings of the first biosequence string to (1) map each substring of the first biosequence string to a location in a hash table, and (2) generate the hash table, the hash table being configured to store an entry at each mapped location that is populated with a pointer to a position in the first biosequence string for the substring of the first biosequence string mapped to that location;

a memory for storing the hash table; and

a field programmable gate array (FPGA) configured to (1) detect substrings of the second biosequence string that are possible matches to substrings of the first biosequence string, and (2) link the detected substrings of the second biosequence string to corresponding positions in the first biosequence string where the detected substrings are located by applying hashing logic to the detected substrings as against the hash table to retrieve the pointers from the hash table entries to which the hashing logic maps the detected substrings.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for performing similarity searching is disclosed wherein programmable logic devices such as field programmable gate arrays (FPGAs) can be used to implement Bloom filters for identifying possible matches between a query and data. The Bloom filters can be implemented in a parallel architecture where the different parallel Bloom filters share access to the same memory units. Further, a hash table may be generated to map a set of strings to keys. In other examples, the hash table may be used to map a set of substrings to a position in a larger string.

385 Citations

34 Claims

1. A system for generating a hash table for use in comparing a first biosequence string with a second biosequence string to assess similarity between the first and second biosequence strings, the system comprising:
- a processor configured to provide hashing on a plurality of substrings of the first biosequence string to (1) map each substring of the first biosequence string to a location in a hash table, and (2) generate the hash table, the hash table being configured to store an entry at each mapped location that is populated with a pointer to a position in the first biosequence string for the substring of the first biosequence string mapped to that location;
  
  a memory for storing the hash table; and
  
  a field programmable gate array (FPGA) configured to (1) detect substrings of the second biosequence string that are possible matches to substrings of the first biosequence string, and (2) link the detected substrings of the second biosequence string to corresponding positions in the first biosequence string where the detected substrings are located by applying hashing logic to the detected substrings as against the hash table to retrieve the pointers from the hash table entries to which the hashing logic maps the detected substrings.
- View Dependent Claims (2, 3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The system of claim 1, wherein the processor is further configured to provide near perfect hashing on the substrings of the first biosequence string to maps the substrings of the first biosequence string to locations in the hash table.
  - 3. The system of claim 1, wherein the processor is further configured to provide perfect hashing on the substrings of the first biosequence string to maps the substrings of the first biosequence string to locations in the hash table.
  - 7. The system of claim 1, wherein the pointers comprise pointers to starting positions in the first biosequence string for the substrings of the first biosequence string mapped to the entries.
  - 8. The system of claim 1, wherein each of a plurality of the entries further comprises a duplicate bit that is indicative of whether the first biosequence string includes multiple occurrences of the substrings mapped to those entries;
    - andwherein the processor is further configured to set the duplicate bits for the entries based on whether multiple occurrences of any of the substrings are found in the first biosequence string.
  - 9. The system of claim 8, wherein the hash table further comprises a duplicate table that identifies pointers to positions in the first biosequence string for duplicate substrings of the first biosequence string mapped to the entries.
  - 10. The system of claim 1, wherein the memory is resident on the FPGA.
  - 11. The system of claim 1, wherein the memory is external from the FPGA.
  - 12. The system of claim 1, wherein the FPGA comprises a Bloom filter to detect substrings of the second biosequence string that are possible matches to substrings of the first biosequence string.
  - 13. The system of claim 1, wherein the application of the detected substrings to the hashing logic is further configured to eliminate a plurality of false positives from the detected substrings, wherein the eliminated false positives correspond to detected substrings that are not mapped by the hashing logic to hash table entries that are populated with pointers for the mapped substrings of the first biosequence string.
  - 14. The system of claim 13, wherein the eliminated false positives are detected substrings of the second biosequence string that are mapped by the hashing logic to empty entries in the hash table.
  - 15. The system of claim 1, wherein the FPGA is configured to as a multistage pipeline that includes pipeline stages for the detect and link operations on a stream of the second biosequence string.
  - 16. The system of claim 13, wherein the multistage pipeline provides BLAST Stage 1 operations.
  - 17. The system of claim 1, wherein the first biosequence string is a query sequence of DNA bases, and wherein the second biosequence string is a database sequence of DNA bases.
  - 18. The system of claim 1, wherein the FPGA further comprises an ungapped extension filter, and wherein the FPGA is further configured to apply the detected substrings linked to positions in the first biosequence string to the ungapped extension filter.
  - 19. The system of claim 18, wherein the FPGA is further configured to identify (1) windows of the second biosequence string around the detected substrings linked to positions in the first biosequence string, and (2) corresponding windows of the first biosequence string around the linked positions for the detected substrings.
  - 20. The system of claim 19, wherein the ungapped extension filter is configured to (1) quantify a similarity between pairs of longer substrings of the first and second biosequence strings within the identified corresponding windows, and (2) identify the pairs for which the quantified similarity is above a threshold.

4. A method for generating a hash table for use in comparing a first biosequence string with a second biosequence string to assess similarity between the first and second biosequence strings, the method comprising:
- hashing a plurality of substrings of the first biosequence string with a processor to map each substring of the first biosequence string to a location a hash table;
  
  generating the hash table, the hash table being storing an entry at each mapped location that is populated with a pointer to a position in the first biosequence string for the substring of the first biosequence string mapped to that location;
  
  storing the hash table within a memory;
  
  a field programmable gate array (FPGA) detecting substrings of the second biosequence string that are possible matches to substrings of the first biosequence string; and
  
  the FPGA linking the detected substrings of the second biosequence string to corresponding positions in the first biosequence string where the detected substrings are located by applying hashing logic to the detected substrings as against the hash table to retrieve the pointers from the hash table entries to which the hashing logic maps the detected substrings.
- View Dependent Claims (5, 6, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 5. The method of claim 4, wherein the hashing step is performed using perfect hashing.
  - 6. The method of claim 4, wherein the hashing step is performed using near perfect hashing.
  - 21. The method of claim 4, wherein the pointers comprise pointers to starting positions in the first biosequence string for the substrings of the first biosequence string mapped to the entries.
  - 22. The method of claim 4, wherein each of a plurality of the entries further comprises a duplicate bit that is indicative of whether the first biosequence string includes multiple occurrences of the substrings mapped to those entries, the method further comprising:
    - the processor setting the duplicate bits for the entries based on whether multiple occurrences of any of the substrings are found in the first biosequence string.
  - 23. The method of claim 22, wherein the generating step further comprises generating the hash table so that the hash table further comprises a duplicate table that identifies pointers to positions in the first biosequence string for duplicate substrings of the first biosequence string mapped to the entries.
  - 24. The method of claim 4, wherein the memory is resident on the FPGA.
  - 25. The method of claim 4, wherein the memory is external from the FPGA.
  - 26. The method of claim 4, wherein the detecting step comprises a Bloom filter on the FPGA detecting substrings of the second biosequence string that are possible matches to substrings of the first biosequence string.
  - 27. The method of claim 4, wherein the application of the detected substrings to the hashing logic is further configured to eliminate a plurality of false positives from the detected substrings, wherein the eliminated false positives correspond to detected substrings that are not mapped by the hashing logic to hash table entries that are populated with pointers for the mapped substrings of the first biosequence string.
  - 28. The method of claim 27, wherein the eliminated false positives are detected substrings of the second biosequence string that are mapped by the hashing logic to empty entries in the hash table.
  - 29. The method of claim 4, wherein the FPGA is configured to as a multistage pipeline that performs the detecting and linking steps in a pipelined manner on a stream of the second biosequence string.
  - 30. The method of claim 29, wherein the multistage pipeline provides BLAST Stage 1 operations.
  - 31. The method of claim 4, wherein the first biosequence string is a query sequence of DNA bases, and wherein the second biosequence string is a database sequence of DNA bases.
  - 32. The method of claim 4, wherein the FPGA further comprises an ungapped extension filter, the method further comprising:
    - the FPGA applying the detected substrings linked to positions in the first biosequence string to the ungapped extension filter.
  - 33. The method of claim 32, further comprising:
    - the FPGA identifying (1) windows of the second biosequence string around the detected substrings linked to positions in the first biosequence string, and (2) corresponding windows of the first biosequence string around the linked positions for the detected substrings.
  - 34. The method of claim 33, wherein the applying step further comprises:
    - the FPGA quantifying a similarity between pairs of longer substrings of the first and second biosequence strings within the identified corresponding windows; and
      
      the FPGA identifying the pairs for which the quantified similarity is above a threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Washington University In St Louis
Original Assignee
Washington University In St Louis
Inventors
Buhler, Jeremy Daniel, Chamberlain, Roger Dean, Franklin, Mark Allen, Gyang, Kwame, Jacob, Arpith Chacko, Krishnamurthy, Praveen, Lancaster, Joseph Marion
Primary Examiner(s)
Zeman, Mary K

Application Number

US15/403,687
Publication Number

US 20170124255A1
Time in Patent Office

1,147 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/2255   Hash tables

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

G16B 50/00   ICT programming tools or da...

G16B 50/30   Data warehousing; Computing...

Method and apparatus for performing similarity searching

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

385 Citations

34 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for performing similarity searching

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

385 Citations

34 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others