Efficient query processing in columnar databases using bloom filters

US 9,367,574 B2
Filed: 03/02/2015
Issued: 06/14/2016
Est. Priority Date: 02/21/2013
Status: Active Grant

First Claim

Patent Images

1. A distributed data warehouse system, comprising:

a plurality of nodes, wherein at least some nodes of the plurality of nodes each comprise;

storage for a columnar database table, wherein said storage comprises a plurality of data blocks; and

a data access module;

the data access module, configured to;

generate a probabilistic data structure for each of one or more data blocks storing data for a column of the columnar database table, wherein each probabilistic data structure indicates data values not stored in the data block;

receive an indication of a query directed to the column of the columnar database table for select data; and

in response to the receipt of the indication of the query, examine the probabilistic data structure for each of the one or more data blocks storing data for the column to determine particular ones of the one or more data blocks which do not need to be read in order to service the query for the select data.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A bloom filter is generated for efficient query processing for unsorted data in a column of a columnar database. Bloom filters represented as bitmaps are generated for data blocks storing data for a column of a columnar database table. An indication of a query directed toward the column is received and the bloom filter for each data block is examined to determine which ones of the data blocks do not need to be read in order to service the query for the select data. Data is then read from the data blocks storing data for the column excepting the ones which do not need to be read.

Citations

20 Claims

1. A distributed data warehouse system, comprising:
- a plurality of nodes, wherein at least some nodes of the plurality of nodes each comprise;
  
  storage for a columnar database table, wherein said storage comprises a plurality of data blocks; and
  
  a data access module;
  
  the data access module, configured to;
  
  generate a probabilistic data structure for each of one or more data blocks storing data for a column of the columnar database table, wherein each probabilistic data structure indicates data values not stored in the data block;
  
  receive an indication of a query directed to the column of the columnar database table for select data; and
  
  in response to the receipt of the indication of the query, examine the probabilistic data structure for each of the one or more data blocks storing data for the column to determine particular ones of the one or more data blocks which do not need to be read in order to service the query for the select data.
- View Dependent Claims (2, 3, 4, 6, 7)
- - 2. The system of claim 1, wherein the data access module is further configured to:
    - receive additional data to be stored in an additional data block for the column of the columnar database table; and
      
      generate an additional probabilistic data structure for the additional data block.
  - 3. The system of claim 1, wherein to generate the probabilistic data structure for each of the one or more data blocks storing the data for the column of the columnar database table, the data access module is configured to:
    - generate a bitmap representing the probabilistic data structure and comprising a plurality of bits; and
      
      populate the bitmap with the different patterns of set bits based, at least in part, on the data stored in the data block to produce the probabilistic data structure.
  - 4. The system of claim 1, wherein the one or more nodes of the plurality of nodes are one or more compute nodes of a data warehouse cluster, wherein a different node of the plurality of nodes is a leader node of the data warehouse cluster, and wherein the leader node is configured to send one or more queries directed to the column of the columnar database table to the one or more compute nodes.
  - 6. The method of claim 2, wherein said generating the probabilistic data structure for each of the one or more data blocks storing the data for the column of the columnar database table comprises:
    - generating a bitmap representing the probabilistic data structure and comprising a plurality of bits; and
      
      populating the bitmap with the different patterns of set bits based, at least in part, on the data stored in the data block to produce the probabilistic data structure.
  - 7. The method of claim 6, wherein said examining the probabilistic data structure for each of the one or more data blocks comprises:
    - for a given data block;
      
      for each data value of the select data;
      
      determining bit pattern locations corresponding to the data value; and
      
      examining the bit pattern locations in the bitmap representing the probabilistic data structure for the given data block to determine whether the given data block is one of the particular ones which do not need to be read in order to service the query for the select data.

5. A method, comprising:
- performing, by one or more computing devices;
  
  generating a probabilistic data structure for each of one or more data blocks storing data for a column of a columnar database table, wherein each probabilistic data structure indicates data values not stored in the data block;
  
  receiving an indication of a query directed to the column of the columnar database table for select data; and
  
  in response to receiving the indication of the query, examining the probabilistic data structure for each of the one or more data blocks storing data for the column to determine particular ones of the one or more data blocks which do not need to be read in order to service the query for the select data.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15)
- - 8. The method of claim 5, further comprising in response to receiving the indication of the query, reading the data from the one or more data blocks storing data for the column in order to service the query for the select data excepting the particular ones of the one or more data blocks which do not need to be read.
  - 9. The method of claim 5, wherein the data for the column of the columnar database table is unsorted.
  - 10. The method of claim 5, further comprising for each of the one or more data blocks, storing the probabilistic data structure in a respective entry in a block metadata data structure that stores information about the one or more data blocks.
  - 11. The method of claim 5, further comprising:
    - receiving additional data to be stored in one of the one or more data blocks for the column of the columnar database table; and
      
      updating the probabilistic data structure for the one data block to include the additional data.
  - 12. The method of claim 5, wherein the indication of the query further indicates that the query is a data join query, wherein the plurality of computing devices are part of a larger collection of computing devices implementing a database cluster in a distributed data warehouse system, wherein the plurality of computing devices implement a compute node of the database cluster, wherein another plurality of computing devices that are part of the larger collection of computing devices implement a different compute node storing another columnar database table, and wherein the method further comprises:
    - performing, by the compute node;
      
      in response to receiving the indication of the query, sending the probabilistic data structure for each of at least some of the one or more data blocks to the different compute node;
      
      performing, by the different compute node;
      
      receiving the indication of the data join query for the select data;
      
      receiving the probabilistic data structure for each of the at least some of the one or more data blocks from the compute node; and
      
      in response to receiving the indication of the data join query and receiving the probabilistic data structure for the at least some of the one or more data blocks, obtaining data from data blocks storing data for the other columnar database table based, at least in part, on the the probabilistic data structure for each of the at least some data blocks in order to service the data join query for the select data.
  - 13. The method of claim 5, further comprising:
    - detecting an indexing event for the column of the columnar database table; and
      
      in response to detecting the indexing event;
      
      for each of the one or more data blocks, generating a new probabilistic data structure which indicates one or more data values not stored in the data block in place of the probabilistic data structure.
  - 14. The method of claim 13, wherein said detecting an indexing event for the column of the columnar database table comprises:
    - for each of the one or more data blocks, evaluating the probabilistic data structure for the data block to determine a selectivity level for the bitmap; and
      
      determining that the selectivity level for at least some of the one or more data blocks is below a selectivity efficiency threshold.
  - 15. The method of claim 13, further comprising:
    - receiving a plurality of indications of a plurality of different queries directed to the column of the columnar database table; and
      
      wherein said detecting an indexing event for the column of the columnar database table comprises analyzing the plurality of different queries to determine that a number of the queries are range queries and that the number of range queries exceeds a query type threshold.

16. A non-transitory, computer-readable storage medium, storing program instructions that when executed by one or more computing devices implement:
- generating a probabilistic data structure for each of one or more data blocks storing data for a column of a columnar database table, wherein each probabilistic data structure indicates data values not stored in the data block;
  
  for each of the one or more data blocks, storing the probabilistic data structure in a respective entry for the data block in a block metadata data structure that stores information about the one or more data blocks;
  
  receiving an indication of a query directed to the column of the columnar database table for select data; and
  
  in response to receiving the indication of the query;
  
  analyzing the probabilistic data structure for each of the one or more data blocks storing data for the column to determine particular ones of the one or more data blocks which do not need to be read in order to service the query for the select data; and
  
  reading the one or more data blocks storing data for the column in order to service the query for the select data excepting the particular ones of the one or more data blocks which do not need to be read.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The non-transitory, computer-readable storage medium of claim 16, wherein, in said generating the probabilistic data structure for each of the one or more data blocks storing data for the column of the columnar database table, the program instructions when executed by the one or more computing devices implement:
    - generating a bitmap representing the probabilistic data structure and comprising a plurality of bits corresponding to the probabilistic data structure size; and
      
      populating the bitmap representing the probabilistic data structure based, at least in part, on the data values stored in the data block.
  - 18. The non-transitory, computer-readable storage medium of claim 17, wherein, in said examining the probabilistic data structure for each of the one or more data blocks storing data for the column to determine particular ones of the one or more data blocks which do not need to be read in order to service the query for the select data, the program instructions when executed by the one or more computing devices implement:
    - for a given data block;
      
      for each data value of the select data;
      
      determining bit pattern locations according to the data value; and
      
      examining the bit pattern locations in the bitmap representing the probabilistic data structure for the given data block to determine whether the given data block is one of the particular ones which do not need to be read in order to service the query for the select data.
  - 19. The non-transitory, computer-readable storage medium of claim 16, wherein the program instructions when executed by the one or more computing devices further implement:
    - detecting an indexing event for the column of the columnar database table; and
      
      in response to detecting the indexing event;
      
      for each of the one or more data blocks, generating a new probabilistic data structure which indicates one or more data values not stored in the data block in place of the probabilistic data structure.
  - 20. The non-transitory, computer-readable storage medium of claim 16, wherein the data for the column of the columnar database table is unsorted.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Gupta, Anurag Windlass
Primary Examiner(s)
Stevens, Robert

Application Number

US14/635,844
Publication Number

US 20150169655A1
Time in Patent Office

470 Days
Field of Search

707/602
US Class Current

1/1
CPC Class Codes

G06F 16/221   Column-oriented storage; Ma...

G06F 16/254   Extract, transform and load...

G06F 16/283   Multi-dimensional databases...

Efficient query processing in columnar databases using bloom filters

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient query processing in columnar databases using bloom filters

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links