Query optimization through the use of multi-column statistics to avoid the problems of non-indexed column correlation
First Claim
1. A method for use in a database management system for optimizing a query, the method comprising:
- collecting at least one type of multi-column statistic to reflect a relationship among multiple selected non-indexed columns of a table, wherein the step of collecting at least one type of multi-column statistic further comprises collecting a first type of multi-column quantile statistics for indicating a number of rows between two given values by dividing the data into a plurality of sub-ranges, each sub-range having an even distribution of data, and determining a frequency and cardinality of each sub-range; and
storing the at least one type of multi-column statistic in a table for subsequent use in determining a selectivity value (a number of qualified rows) for predicates in the query, wherein the selectivity value is used in optimizing execution of the query.
2 Assignments
0 Petitions
Accused Products
Abstract
The system, method, and program of this invention collects multi-column statistics, by a database management system, to reflect a relationship among multiple columns of a table in a relational database. These statistics are stored in the system catalog, and are used during query optimization to obtain an estimate of the number of qualifying rows when a query has predicates on multiple columns of a table.
A multi-column linear quantile statistic is collected by dividing the data of multiple columns into sub-ranges where each sub-range has approximately an even distribution of data, and determining a frequency and cardinality of each sub-range. A multi-column polygonal quantile statistic is collected by dividing the data of multiple columns into sub-spaces where each sub-space contains approximately the same number of tuples, and determining a frequency and cardinality of each sub-space.
The system catalog is accessed for the stored multi-column linear quantile statistic for a query having a single range predicate and at least one equal predicate to determine the selectivity value for the predicates of the query. The system catalog is accessed for the stored multi-column polygonal quantile statistic for a query having more than one range predicate. These statistics are used in various ways to determine the selectivity value for the predicates of the query.
129 Citations
37 Claims
-
1. A method for use in a database management system for optimizing a query, the method comprising:
-
collecting at least one type of multi-column statistic to reflect a relationship among multiple selected non-indexed columns of a table, wherein the step of collecting at least one type of multi-column statistic further comprises collecting a first type of multi-column quantile statistics for indicating a number of rows between two given values by dividing the data into a plurality of sub-ranges, each sub-range having an even distribution of data, and determining a frequency and cardinality of each sub-range; and
storing the at least one type of multi-column statistic in a table for subsequent use in determining a selectivity value (a number of qualified rows) for predicates in the query, wherein the selectivity value is used in optimizing execution of the query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
if there are multi-column frequent value statistics containing a same plurality of columns as a plurality of equal predicates in the query, and if literals in the predicates match literals in the stored multi-column frequent value, then determining the selectivity value as being equal to a frequency of the multi-column frequent value. -
7. The method of claim 4 further comprising accessing the table, and
if there are multi-column frequent value statistics containing a same plurality of columns as a plurality of equal predicates in the query, and if literals in the predicates do not match literals in the stored multi-column frequent value, then determining the selectivity value as: -
8. The method of claim 1 further comprising accessing the table for the first type of multi-column quantile statistics for the multiple columns for the query having a single range predicate and at least one equal predicate to determine the selectivity value for the predicates of the query.
-
9. The method of claim 8 further comprising determining the selectivity value as follows:
-
for predicates that are completely satisfied by one of the plurality of sub-ranges, the selectivity is the frequency;
for predicates that are partially satisfied by one of the sub-ranges, a final selectivity is equal to a selectivity of fully qualified sub-ranges plus a selectivity of partially qualified sub-ranges.
-
-
10. The method of claim 9 wherein the step of determining, for predicates that are partially satisfied by one of the sub-ranges, further comprises translating a boundary of a sub-range by concatenating a set of values of a lower boundary and concatenating a set of values of a higher boundary for the sub-range;
- and using the concatenated values in the selectivity determination.
-
-
11. A method for use in a database management system for optimizing a query, the method comprising:
-
collecting at least one type of multi-column statistic to reflect a relationship among multiple selected non-indexed columns of a table, wherein the step of collecting at least one type of multi-column statistic further comprises collecting a second type of multi-column quantile statistics for indicating a number of rows between two given sets of values by dividing the data into a plurality of sub-spaces, each sub-space containing approximately a same number of tuples, and determining a frequency and cardinality of each sub-space; and
storing the at least one type of multi-column statistic in a table for subsequent use in determining a selectivity value (a number of qualified rows) for predicates in the query, wherein the selectivity value is used in optimizing execution of the query. - View Dependent Claims (12, 13, 14)
Where; X1, X2 are the high and low bounds for the X-coordinate of the query Y1, Y2 are the high and low bounds for the Y-coordinate of the query Z1, Z2 are the high and low bounds for the Z-coordinate of the query . . . etc., for each coordinate (dimension) of the query XA, XB are the high and low bounds for the X-coordinate of the quantile YA, YB are the high and low bounds for the Y-coordinate of the quantile ZA, ZB are the high and low bounds for the Z-coordinate of the quantile . . . etc., for each coordinate (dimension) of the quantile.
-
-
15. A database management system comprising:
-
means for collecting at least one type of multi-column statistic to reflect a relationship among multiple non-indexed columns of a table, wherein the means for collecting at least one type of multi-column statistic further comprises means for collecting a first type of multi-column quantide statistics for indicating a number of rows between two given values by dividing the data into a plurality of sub-ranges, each sub-range having an even distribution of data, and means for determining a frequency and cardinality of each sub-range; and
means for storing the at least one type of multi-column statistic in a table for subsequent use in determining a selectivity value (a number of qualified rows) for predicates in the query, wherein the selectivity value is used in optimizing execution of the query. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
for predicates that are completely satisfied by one of the plurality of sub-ranges, the selectivity is the frequency;
for predicates that are partially satisfied by one of the sub-ranges, a final selectivity is equal to a selectivity of fully qualified sub-ranges plus a selectivity of partially qualified sub-ranges.
-
-
22. The system of claim 21 wherein the means for determining further comprises translating a boundary of a sub-range by concatenating a set of values of a lower boundary and concatenating a set of values of a higher boundary for the sub-range;
- and using the concatenated values in the selectivity determination.
-
23. A database management system comprising:
-
means for collecting at least one type of multi-column statistic to reflect a relationship among multiple non-indexed columns of a table, wherein the means for collecting at least one type of multi-column statistic further comprises means for collecting a second type of multi-column quantile statistics for indicating a number of rows between two given sets of values by dividing the data into a plurality of sub-spaces, each sub-space containing approximately a same number of tuples, and means for determining a frequency and cardinality of each sub-space; and
means for storing the at least one type of multi-column statistic in a table for subsequent use in determining a selectivity value (a number of qualified rows) for predicates in the query, wherein the selectivity value is used in optimizing execution of the query. - View Dependent Claims (24, 25, 26)
Where; X1, X2 are the high and low bounds for the X-coordinate of the query Y1, Y2 are the high and low bounds for the Y-coordinate of the query Z1, Z2 are the high and low bounds for the Z-coordinate of the query . . . etc., for each coordinate (dimension) of the query XA, XB are the high and low bounds for the X-coordinate of the quantile YA, YB are the high and low bounds for the Y-coordinate of the quantile ZA, ZB are the high and low bounds for the Z-coordinate of the quantile . . . etc., for each coordinate (dimension) of the quantile.
-
-
27. Computer programming code residing on at least one computer usable medium (i.e., a program product) for use in a database management system, the program product comprising:
-
means for causing a collection of at least one type of multi-column statistic to reflect a relationship among multiple non-indexed columns of a table, wherein the means for causing a collection of at least one type of multi-column statistic further comprises means for causing a collection of a first type of multi-column quantile statistics for indicating a number of rows between two given values by dividing the data into a plurality of sub-ranges, each sub-range having an even distribution of data, and means for determining a frequency and cardinality of each sub-range; and
means for causing a storing of the at least one type of multi-column statistic in a table for subsequent use in determining a selectivity value (a number of qualified rows) for predicates in the query, wherein the selectivity value is used in optimizing execution of the query. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34)
for predicates that are completely satisfied by one of the plurality of sub-ranges, the selectivity is the frequency;
for predicates that are partially satisfied by one of the sub-ranges, a final selectivity is equal to a selectivity of fully qualified sub-ranges plus a selectivity of partially qualified sub-ranges.
-
-
34. The program product of claim 33 wherein the means for causing a determination further comprises means for causing a translation of a boundary of a sub-range by concatenating a set of values of a lower boundary and concatenating a set of values of a higher boundary for the sub-range;
- and means for using the concatenated values in the selectivity determination.
-
35. Computer programming code residing on at least one computer usable medium (i.e., a program product) for use in a database management system, the program product comprising:
-
means for causing a collection of at least one type of multi-column statistic to reflect a relationship among multiple non-indexed columns of a table, wherein the means for causing a collection of at least one type of multi-column statistic further comprises means for causing a collection of a second type of multi-column quantile statistic for indicating a number of rows between two given sets of values by dividing the data into a plurality of sub-spaces, each sub-space containing approximately a same number of tuples, and means for causing a determination of a frequency and cardinality of each sub-space; and
means for causing a storing of the at least one type of multi-column statistic in a table for subsequent use in determining a selectivity value (a number of qualified rows) for predicates in the query, wherein the selectivity value is used in optimizing execution of the query. - View Dependent Claims (36, 37)
-
Specification