Query selectivity estimation with confidence interval
First Claim
1. A method implemented at least in part by a processor executing computer-executable instructions stored in computer-readable storage media for selecting a query plan from at least two query plans, each of the at least two query plans having a query expression on a database, the method comprising:
- computing a uniform random sample of database tuples by randomly sampling tuples from the database;
for each of the at least two query plans, deriving, by the processor, a probability distribution for possible selectivity values of the query expression by evaluating each query expression on the sample of database tuples to determine an observed selectivity for the sample of database tuples;
forming a probability density function based on the observed selectivity using Bayes'"'"'s rule;
receiving a desired selectivity confidence threshold from a user input, wherein the desired selectivity confidence threshold represents a tradeoff between predictability and performance for query execution time;
evaluating, by the process, each of the derived probability distributions using the desired selectivity confidence threshold to derive an estimated selectivity for the derived probability distribution, wherein the estimated selectivity is derived by inverting a cumulative distribution function for the probability distribution and applying the inverted cumulative distribution function to the desired selectivity confidence threshold;
selecting one of the query plans from the at least two query plans based on the estimated selectivities and at least one crossover point, each at least one crossover point defining a selectivity where query execution time of one query plan is estimated to equal query execution time of another query plan of the at least two query plans; and
executing the selected query plan to retrieve and store records from a database.
3 Assignments
0 Petitions
Accused Products
Abstract
Selectivity estimates are produced that meet a desired confidence threshold. To determine the confidence level of a given selectivity estimate for a query expression, the query expression is evaluated on a sample tuples. A probability density function is derived based on the number of tuples in the sample that satisfy the query expression. The cumulative distribution for the probability density function is solved for the given threshold to determine a selectivity estimate at the given confidence value.
37 Citations
19 Claims
-
1. A method implemented at least in part by a processor executing computer-executable instructions stored in computer-readable storage media for selecting a query plan from at least two query plans, each of the at least two query plans having a query expression on a database, the method comprising:
-
computing a uniform random sample of database tuples by randomly sampling tuples from the database; for each of the at least two query plans, deriving, by the processor, a probability distribution for possible selectivity values of the query expression by evaluating each query expression on the sample of database tuples to determine an observed selectivity for the sample of database tuples; forming a probability density function based on the observed selectivity using Bayes'"'"'s rule; receiving a desired selectivity confidence threshold from a user input, wherein the desired selectivity confidence threshold represents a tradeoff between predictability and performance for query execution time; evaluating, by the process, each of the derived probability distributions using the desired selectivity confidence threshold to derive an estimated selectivity for the derived probability distribution, wherein the estimated selectivity is derived by inverting a cumulative distribution function for the probability distribution and applying the inverted cumulative distribution function to the desired selectivity confidence threshold; selecting one of the query plans from the at least two query plans based on the estimated selectivities and at least one crossover point, each at least one crossover point defining a selectivity where query execution time of one query plan is estimated to equal query execution time of another query plan of the at least two query plans; and executing the selected query plan to retrieve and store records from a database. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for selecting a query plan from at least two query plans, each of the at least two query plans having a query expression executed on a database comprising:
-
a processor; and memory including software components that are executed by the processor, the memory including a query expression evaluator that evaluates for each of the at least two query plans the query expression on a sample of database tuples to determine an observed selectivity; a probability distribution derivation module that derives a probability distribution of possible selectivity values for each of the query expressions based on the observed selectivity; a probability distribution resolver that receives a user input indicating a desired selectivity confidence threshold determines estimated selectivities for the at least two query plans based on the desired selectivity confidence and the probability distribution, wherein the desired selectivity confidence threshold represents a tradeoff between predictability and performance for query execution time; a query optimizer that selects one of the query plans from the at least two query plans based on the estimated selectivities and at least one crossover point, each at least one crossover point defining a selectivity at which query execution time of one query plan is estimated to equal query execution time of another query plan; and an execution engine that executes the selected query plan to retrieve and stores record from the database. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
-
15. One or more computer readable media having computer-executable instructions stored thereon for selecting a query plan from at least two query plans, each of the at least two query plans having a query expression executed on a database, the instructions, when executed on a computer execute steps comprising:
-
evaluating, for each of the at least two query plans, the query expression on a sample of database tuples to determine an observed selectivity; for each of the at least two query plans, deriving a probability distribution of possible selectivity values of the query expression based on the observed selectivity; receiving a desired selectivity threshold from a user input, wherein the confidence threshold provides user control over a tradeoff between predictability and performance for query execution time; determining estimated selectivities for the at least two query plans based on the desired selectivity confidence threshold and the derived probability distributions, wherein the estimated selectivities are determined by inverting a cumulative distribution function for the probability distribution and applying the cumulative distribution function on the desired selectivity confidence threshold; selecting one of the query plans from the at least two query plans based on the estimated selectivities and at least one crossover point, each at least one crossover point defining a selectivity at which query execution time of one query plan is estimated to equal query execution time of another query plan, and executing the selected query plan to retrieve and store records from the database. - View Dependent Claims (16, 17, 18, 19)
-
Specification