Database system with methods for performing cost-based estimates using spline histograms
First Claim
1. In a computer system providing a database storing database objects, a method for improving modeling of cost estimates associated with data access occurring during execution of a database query, the method comprising:
- receiving a request to create a histogram for modeling data distribution of a particular database object;
in response to said request, creating a histogram having a plurality of histogram cells for modeling occurrence of values in the database object, each cell being associated with a particular range of values from the particular database object, and each cell having a weighting for indicating how often values in the particular range occur in the database object relative to occurrence of other values in the database object; and
for each particular histogram cell, determining selectivity of the cell by;
applying interpolation to the cell for determining a spline weighting indicating how values are actually distributed within the particular range of the cell, andbased on the determination, assigning said spline weighting to the cell for indicating how values are distributed within the cell itself.
1 Assignment
0 Petitions
Accused Products
Abstract
Database system and methods are described for improving execution speed of database queries (e.g., for decision support) by provides methods employing spline histograms for improving the determination of selectivity estimates. The general approach improves histogram-based cost estimates as follows. The constant associated with a predicate (e.g., in r.a>5, the constant is "5") is used to do a binary search in an array of histogram boundary values, for determining a particular histogram cell. Once a cell has been found, the system employs interpolation to find out how much of the cell has been selected. Once this interpolation value is found, it is used with a cell weighting and a spline value or weighting to estimate the selectivity of the predicate value, which takes into account how data values are distributed within the cell. As a result of increased accuracy of estimates, the system can formulate better query plans and, thus, provides better performance.
54 Citations
32 Claims
-
1. In a computer system providing a database storing database objects, a method for improving modeling of cost estimates associated with data access occurring during execution of a database query, the method comprising:
-
receiving a request to create a histogram for modeling data distribution of a particular database object; in response to said request, creating a histogram having a plurality of histogram cells for modeling occurrence of values in the database object, each cell being associated with a particular range of values from the particular database object, and each cell having a weighting for indicating how often values in the particular range occur in the database object relative to occurrence of other values in the database object; and for each particular histogram cell, determining selectivity of the cell by; applying interpolation to the cell for determining a spline weighting indicating how values are actually distributed within the particular range of the cell, and based on the determination, assigning said spline weighting to the cell for indicating how values are distributed within the cell itself. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. In a computer system providing a database storing database objects, a method for improving modeling of cost estimates associated with data access occurring during execution of a database query, the method comprising:
-
receiving a request to create a histogram for modeling data distribution of a particular database object; in response to said request, creating a histogram having a plurality of histogram cells for modeling occurrence of values in the database object, each cell being associated with a particular range of values from the particular database object, and each cell having a weighting for indicating how often values in the particular range occur in the database object relative to occurrence of other values in the database object; and for each particular histogram cell, determining how values are actually distributed within the particular range of the cell, and based on the determination, assigning a spline weighting to the cell for indicating how values are distributed within the cell itself, wherein each range is defined by a lower boundary and an upper boundary, and wherein a histogram cell is assigned a positive spline weighting if values within the cell are distributed more densely near the upper boundary of the particular range for the cell.
-
-
21. In a computer system providing a database storing database objects, a method for improving modeling of cost estimates associated with data access occurring during execution of a database query, the method comprising:
-
receiving a request to create a histogram for modeling data distribution of a particular database object; in response to said request, creating a histogram having a plurality of histogram cells for modeling occurrence of values in the database object, each cell being associated with a particular range of values from the particular database object, and each cell having a weighting for indicating how often values in the particular range occur in the database object relative to occurrence of other values in the database object; and for each particular histogram cell, determining how values are actually distributed within the particular range of the cell, and based on the determination, assigning a spline weighting to the cell for indicating how values are distributed within the cell itself, wherein each range is defined by a lower and upper boundary, and wherein a histogram cell is assigned a negative spline weighting if values within the cell are distributed more densely near the lower boundary of the particular range for the cell.
-
-
22. In a computer system providing a database storing database objects, a method for improving modeling of cost estimates associated with data access occurring during execution of a database query, the method comprising:
-
receiving a request to create a histogram for modeling data distribution of a particular database object; in response to said request, creating a histogram having a plurality of histogram cells for modeling occurrence of values in the database object, each cell being associated with a particular range of values from the particular database object, and each cell having a weighting for indicating how often values in the particular range occur in the database object relative to occurrence of other values in the database object; and for each particular histogram cell, determining how values are actually distributed within the particular range of the cell, and based on the determination, assigning a spline weighting to the cell for indicating how values are distributed within the cell itself, receiving a query predicate which requires an estimate of selectivity of a given value; based on said given value, determining a particular histogram cell having a particular range which includes the given value; and from the determined particular histogram cell, estimating selectivity of the given value, wherein said estimating step includes calculating selectivity by (i) calculating a uniform component of weighting; (ii) calculating a spline component of weighting; and (iii) calculating selectivity by adding the uniform component to the spline component; wherein the uniform component for the particular cell is determined, for a positive spline weighting, by subtracting the spline weighting from the cell weighting and then multiplying that quantity by an interpolate fraction, which indicates what fraction of the cell is selected by the given value. - View Dependent Claims (23)
-
-
24. A database system providing spline-based cost estimates comprising:
-
a database storing database objects; an optimizer which employs a histogram for modeling data distribution of a particular database object; means for creating a histogram having a plurality of histogram cells for modeling occurrence of values in the database object, each cell being associated with a particular range of values from the particular database object, and each cell having a weighting for indicating how often values in the particular range occur in the database object relative to occurrence of other values in the database object; means for determining how values are actually distributed within each particular cell by calculating a spline weighting for each particular cell using interpolation; and means for assigning said spline weighting to each particular cell for indicating selectivity of each particular cell based on how values are actually distributed within each particular cell. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31)
-
-
32. A database system providing spline-based cost estimates comprising:
-
a database storing database objects; an optimizer which employs a histogram for modeling data distribution of a particular database object; means for creating a histogram having a plurality of histogram cells for modeling occurrence of values in the database object, each cell being associated with a particular range of values from the particular database object, and each cell having a weighting for indicating how often values in the particular range occur in the database object relative to occurrence of other values in the database object; means for determining how values are actually distributed within each particular cell, and means for assigning a spline weighting to each particular cell for indicating how values are actually distributed within each particular cell; wherein each range is defined by a lower boundary and an upper boundary, and wherein a histogram cell is assigned a positive spline weighting if values within the cell are distributed more densely near the upper boundary of the particular range for the cell and is assigned a negative spline weighting if values within the cell are distributed more densely near the lower boundary of the particular range for the cell.
-
Specification