Embracing and exploiting data skew during a join or groupby
First Claim
1. A method, comprising:
- during a query optimization for a database query involving a join operation, obtaining a distribution of data values in a join column of an inner table;
using the distribution, identifying one or more data ranges containing skew;
performing a cost-benefit analysis for a skew specific join scheme, wherein the cost benefit analysis is based on a tradeoff between a number of the data values in a data range of the one or more data ranges and additional overhead costs of processing the number of data values using the skew specific join scheme;
for each data range identified as containing skew, performing the join operation using the skew specific join scheme based on the cost-benefit analysis; and
for each data range not identified as containing skew, performing the join operation using a non-skew specific join scheme.
1 Assignment
0 Petitions
Accused Products
Abstract
A hybrid approach for performing a join in a database includes: obtaining a distribution of data values in a join column of an inner table; using the distribution, identifying one or more data ranges containing skew; for each data range identified as containing skew, performing, by the processor, the join operation using a skew specific join scheme; and for each data range not identified as containing skew, performing, by the processor, the join operation using a non-skew specific join scheme. One skew specific join scheme involves a compact array table, a highly populated array that represents the range of values that the inner table join column contains. One non-skew specific join scheme involves a compact hash table, an optimized hash table that allows high load factors with a small memory overhead. In combining multiple join techniques, joins may be performed more efficiently for skewed and non-skewed data.
153 Citations
7 Claims
-
1. A method, comprising:
-
during a query optimization for a database query involving a join operation, obtaining a distribution of data values in a join column of an inner table; using the distribution, identifying one or more data ranges containing skew; performing a cost-benefit analysis for a skew specific join scheme, wherein the cost benefit analysis is based on a tradeoff between a number of the data values in a data range of the one or more data ranges and additional overhead costs of processing the number of data values using the skew specific join scheme; for each data range identified as containing skew, performing the join operation using the skew specific join scheme based on the cost-benefit analysis; and for each data range not identified as containing skew, performing the join operation using a non-skew specific join scheme. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
Specification