Efficient query processing using histograms in a columnar database

US 10,372,723 B2
Filed: 09/15/2017
Issued: 08/06/2019
Est. Priority Date: 01/15/2013
Status: Active Grant

First Claim

Patent Images

1. A system, comprising:

one or more hardware processors and memory with program instructions to;

determine a bucket range size for each of a plurality of buckets for a histogram of a column of a columnar database table, wherein each bucket of the plurality of buckets represents an existence of one or more data values of the data in the column within a range of values according to the determined bucket range size;

generate a probabilistic data structure for each of one or more data blocks storing data for the column of the columnar database table, wherein the probabilistic data structure indicates for which particular buckets of the plurality of buckets in the histogram there is a data value stored in the data block; and

examine the probabilistic data structure, responsive to a query, for each of the one or more data blocks storing data for the column to determine ones of the one or more data blocks which do not need to be read in order to service the query.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A probabilistic data structure is generated for efficient query processing using a histogram for unsorted data in a column of a columnar database. A bucket range size is determined for multiples buckets of a histogram of a column in a columnar database table. In at least some embodiments, the histogram may be a height-balanced histogram. A probabilistic data structure is generated to indicate for which particular buckets in the histogram there is a data value stored in the data block. When an indication of a query directed to the column for select data is received, the probabilistic data structure for each of the data blocks storing data for the column may be examined to determine particular ones of the data blocks which do not need to be read in order to service the query for the select data.

29 Citations

20 Claims

1. A system, comprising:
- one or more hardware processors and memory with program instructions to;
  
  determine a bucket range size for each of a plurality of buckets for a histogram of a column of a columnar database table, wherein each bucket of the plurality of buckets represents an existence of one or more data values of the data in the column within a range of values according to the determined bucket range size;
  
  generate a probabilistic data structure for each of one or more data blocks storing data for the column of the columnar database table, wherein the probabilistic data structure indicates for which particular buckets of the plurality of buckets in the histogram there is a data value stored in the data block; and
  
  examine the probabilistic data structure, responsive to a query, for each of the one or more data blocks storing data for the column to determine ones of the one or more data blocks which do not need to be read in order to service the query.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein to determine the plurality of bucket range sizes for the histogram of the column of the columnar database table, the program instructions are executable to:
    - obtain the data of the column;
      
      generate the plurality of buckets; and
      
      set a bucket range size of the plurality of bucket range sizes for each bucket for the histogram such that the data of the column is evenly distributed among the buckets.
  - 3. The system of claim 1, wherein the probabilistic data structure is a bitmap comprising a plurality of bits, wherein each bit of the bitmap represents each bucket of the plurality of buckets for the histogram, and for every data value included in the bucket range size stored in the data block the bit of the bitmap corresponding to the bucket is set.
  - 4. The system of claim 1, wherein the program instructions are executable to store the probabilistic data structure of each of the one or more data blocks in a respective entry in a block metadata structure that stores information about the one or more data blocks.
  - 5. The system of claim 1, further comprising at least one compute node as a leader node of a distributed data warehouse cluster.
  - 6. The system of claim 1, wherein the histogram of the column of the columnar database table is a height-balanced histogram.
  - 7. The system of claim 6, the program instructions are executable to:
    - detect a rebalancing event for the distribution of data in the column among the plurality of buckets;
      
      in response to detecting the rebalancing event;
      
      modify the bucket range size for each of the plurality of buckets for the height-balanced histogram of the column; and
      
      update each probabilistic data structure for each of the one or more data blocks according to the modified bucket range size of the plurality of buckets.

8. A method, comprising:
- performing, by one or more computing devices comprising one or more hardware processors and memory;
  
  determining a bucket range size for each of a plurality of buckets for a histogram of a column of a columnar database table, wherein each bucket of the plurality of buckets represents an existence of one or more data values of the data in the column within a range of values according to the determined bucket range size;
  
  generating a probabilistic data structure for each of one or more data blocks storing data for the column of the columnar database table, wherein the probabilistic data structure indicates for which particular buckets of the plurality of buckets in the histogram there is a data value stored in the data block; and
  
  examining the probabilistic data structure, responsive to a query, for each of the one or more data blocks storing data for the column to determine ones of the one or more data blocks which do not need to be read in order to service the query.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method of claim 8, wherein said determining a bucket range size for each of a plurality of buckets for the histogram of the column of the columnar database table comprises:
    - obtaining the data of the column;
      
      generating the plurality of buckets; and
      
      setting a bucket range size of the plurality of bucket range sizes for each bucket such that the data of the column is evenly distributed among the buckets.
  - 10. The method of claim 8, wherein said generating the probabilistic data structure for each of the one or more data blocks storing data for the column of the columnar database table comprises:
    - generating a bitmap for the data block comprising a plurality of bits, wherein each bit represents a different bucket of the plurality of buckets for the histogram; and
      
      setting the respective bit in the bitmap for each of the particular buckets for which there is the data value stored in the data block.
  - 11. The method of claim 10, further comprising storing the probabilistic data structure of each of the one or more data blocks in a respective entry in a block metadata structure that stores information about the one or more data blocks.
  - 12. The method of claim 11, wherein said examining the probabilistic data structure for each of the one or more data blocks storing data for the column to determine the particular ones of the one or more data blocks which do not need to be read in order to service the query for the select data comprises:
    - determining one or more bits representing the one or more buckets within the range of values including the select data; and
      
      examining the one or more bits in each bitmap stored in the block metadata structure for the one or more data blocks to identify those data blocks without one of the one or more bits set as the particular ones which do not need to be read in order to service the query for the select data.
  - 13. The method of claim 8, wherein the histogram of the column of the columnar database table is a height-balanced histogram.
  - 14. The method of claim 13, further comprising:
    - detecting a rebalancing event for the distribution of data in the column among the plurality of buckets;
      
      in response to detecting the rebalancing event;
      
      modifying the bucket range size for each of the plurality of buckets for the height-balanced histogram of the column; and
      
      updating each probabilistic data structure for each of the one or more data blocks according to the modified bucket range size of the plurality of buckets.

15. A non-transitory, computer-readable storage medium, storing program instructions that when executed by one or more computing devices implement:
- determining a bucket range size for each of a plurality of buckets for a histogram of a column of a columnar database table, wherein each bucket of the plurality of buckets represents an existence of one or more data values of the data in the column within a range of values according to the determined bucket range size;
  
  generating a probabilistic data structure for each of one or more data blocks storing data for the column of the columnar database table, wherein the probabilistic data structure indicates for which particular buckets of the plurality of buckets in the histogram there is a data value stored in the data block; and
  
  examining the probabilistic data structure, responsive to a query, for each of the one or more data blocks storing data for the column to determine ones of the one or more data blocks which do not need to be read in order to service the query.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory, computer-readable storage medium of claim 15, wherein the program instructions when further executed by the one or more computing devices implement:
    - obtaining the data of the column;
      
      generating the plurality of buckets; and
      
      setting a bucket range size of the plurality of bucket range sizes for each bucket such that the data of the column is evenly distributed among the buckets.
  - 17. The non-transitory, computer-readable storage medium of claim 15, wherein the probabilistic data structure is a bitmap comprising a plurality of bits, wherein each bit of the bitmap represents each bucket of the plurality of buckets for the height-balanced histogram, and for every data value included in the bucket range size stored in the data block the bit of the bitmap corresponding to the bucket is set.
  - 18. The non-transitory, computer-readable storage medium of claim 15, wherein the height-balanced histogram generator is further configured to store the probabilistic data structure of each of the one or more data blocks in a respective entry in a block metadata structure that stores information about the one or more data blocks.
  - 19. The non-transitory, computer-readable storage medium of claim 15, wherein the program instructions when further executed by the one or more computing devices implement a leader node of a distributed data warehouse cluster.
  - 20. The non-transitory, computer-readable storage medium of claim 15, wherein the histogram for the column of the columnar database table is a height-balanced histogram, and wherein the program instructions when further executed by the one or more computing devices implement:
    - detecting a rebalancing event for the distribution of data in the column among the plurality of buckets;
      
      in response to detecting the rebalancing event;
      
      modifying the bucket range size for each of the plurality of buckets for the height-balanced histogram of the column; and
      
      updating each probabilistic data structure for each of the one or more data blocks according to the modified bucket range size of the plurality of buckets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Gupta, Anurag Windlass
Primary Examiner(s)
Le, Debbie M

Application Number

US15/706,511
Publication Number

US 20180025065A1
Time in Patent Office

690 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/221   Column-oriented storage; Ma...

G06F 16/2237   Vectors, bitmaps or matrices

G06F 16/245   Query processing

G06F 16/24554   Unary operations; Data part...

G06F 16/254   Extract, transform and load...

G06F 16/278   Data partitioning, e.g. hor...

G06F 16/283   Multi-dimensional databases...

Efficient query processing using histograms in a columnar database

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient query processing using histograms in a columnar database

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links