Self-tuning histogram and database modeling
First Claim
1. In a database system, a method of maintaining a self-tuning histogram having a plurality of buckets defined by at least two bucket boundaries and a bucket frequency comprising:
- (a) identifying the histogram;
(b) using the histogram to generate an estimated result size in response to a user query;
(c) calculating an estimation error based on the estimated result size and the size of the result of the user query;
(d) dividing the estimation error into one or more portions;
(e) applying the portions to the bucket frequencies of one or buckets to modify the bucket frequencies.
3 Assignments
0 Petitions
Accused Products
Abstract
Building histograms by using feedback information about the execution of query workload rather than by examining the data helps reduce the cost of building and maintaining histograms. A method of maintaining self-tuning histograms updates histograms based on feedback about the execution of a user query. A histogram may be initialized using an assumption of uniform distribution of data or by combining existing histograms. A histogram tuner accesses and estimated result in response to a user query generated by using the histogram. The histogram tuner calculates an estimation error based on the result of the user query and the estimated result. The frequencies of histogram buckets are refined based on the estimation error. The bucket bounds of the histogram are restructured based on the refined frequencies. The method may be performed on-line after a user query or off-line by accessing a workload log. By updating a histogram without accessing the database, the cost of building and maintaining histograms is significantly reduced.
-
Citations
41 Claims
-
1. In a database system, a method of maintaining a self-tuning histogram having a plurality of buckets defined by at least two bucket boundaries and a bucket frequency comprising:
-
(a) identifying the histogram;
(b) using the histogram to generate an estimated result size in response to a user query;
(c) calculating an estimation error based on the estimated result size and the size of the result of the user query;
(d) dividing the estimation error into one or more portions;
(e) applying the portions to the bucket frequencies of one or buckets to modify the bucket frequencies. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of building and maintaining a self-tuning histogram having a plurality of buckets each defined by at least two bucket boundaries and a bucket frequency comprising the steps of:
-
a) initializing the histogram;
b) using the histogram to generate an estimated result size in response to a user query;
c) calculating an estimation error based on the size of the result of the user query and the estimated result size;
d) modifying the frequency of the buckets by applying a damping factor to the estimation error, dividing the estimation error into one or more portions and applying the portions to the bucket frequencies of the buckets used to generate the estimation error in proportion to their relative bucket frequency; and
e) redefining the bucket boundaries of one or more of the buckets by merging adjacent buckets having relatively similar frequencies into single buckets to free buckets, selecting buckets belonging to a set of buckets having the highest frequency to be split, and splitting the selected buckets into a plurality of buckets such that the total number of buckets remains constant. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method of building and maintaining an n dimensional self-tuning histogram having a plurality of buckets arranged in a grid, each bucket defined by at least four grid partitions and a bucket frequency, the histogram being defined by n scales, one scale for each dimension, and an n dimensional frequency matrix, each of the scales defining the partitions of the frequency matrix which divide the frequency matrix into a plurality of (n−
- 1) dimensional slices of the frequency matrix comprising the steps of;
a) initializing the histogram;
b) using the histogram to generate an estimated result size in response to a user query;
c) calculating an estimation error based on the size of the result of the user query and the estimated result size;
d) modifying the frequency of the buckets in the frequency matrix by dividing the estimation error into one or more portions and applying the portions to the bucket frequencies of the buckets used to generate the estimation error in proportion to their relative bucket frequency; and
e) changing the scales of the histogram in at least one dimension to redefine the grid partitions by merging adjacent slices of the frequency matrix having relatively similar frequencies to free buckets, selecting slices of the frequency matrix belonging to a set of slices having the highest marginal frequency to be split, and splitting the selected slices into a plurality of slices such that the total number of buckets remains constant. - View Dependent Claims (18, 19, 20, 21, 22)
- 1) dimensional slices of the frequency matrix comprising the steps of;
-
23. A computer-readable medium having computer executable instructions for performing steps for maintaining a self-tuning histogram in database wherein user queries are executed on the database to return query results, the method comprising:
-
(a) identifying a self-tuning histogram having a plurality of buckets defined by at least two bucket boundaries and a bucket frequency; and
(b) updating the histogram based on the results of an executed query against the database by using the histogram to generate an estimated result size in response to a user query, calculating an estimation error based on the size of the result of the user query and the estimated result size, and updating the histogram based on the estimation error. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
I) dividing the estimation error into one or more portions; and
II) applying the portions to the bucket frequencies of the buckets used to generate the estimated result.
-
-
26. The computer readable medium of claim 24 wherein the step of modifying the bucket frequency of the one or more buckets comprises:
-
I) dividing the estimation error into one or more portions; and
II) applying the portions to bucket frequencies of the buckets in proportion to their relative bucket frequency.
-
-
27. The computer readable medium of claim 23 wherein the step of identifying the histogram comprises initializing the histogram.
-
28. The computer readable medium of claim 27 wherein the step of initializing the histogram comprises assigning substantially uniform bucket frequencies and bucket boundaries to the buckets.
-
29. The computer readable medium of claim 23 wherein the self-tuning histogram is multi-dimensional.
-
30. The computer readable medium of claim 23 additionally including computer readable instructions for redefining one or more bucket boundaries based on bucket frequencies.
-
31. The computer readable medium of claim 30 wherein the step of redefining the one or more bucket boundaries comprises splitting one or more buckets that meet a split criteria into a plurality of buckets.
-
32. The computer readable medium of claim 31 wherein the step of redefining the one or more bucket boundaries comprises merging one or more sets of adjacent buckets that meet a merge criteria.
-
33. The computer readable medium of claim 32 wherein the merge criteria comprises having a bucket frequency relatively similar to the bucket frequency of adjacent buckets.
-
34. The computer readable medium of claim 32 wherein the step of merging buckets generates free buckets for splitting buckets.
-
35. The computer readable medium of claim 34 wherein the step of splitting buckets comprises allocating freed buckets to one or more buckets in proportion to their relative bucket frequency.
-
36. The computer readable medium of claim 31 wherein the split criteria comprises belonging to a set of buckets having the highest bucket frequencies.
-
37. The computer readable medium of claim 31 wherein the step of redefining one or more bucket boundaries is performed such that the total number of buckets remains constant.
-
38. An apparatus for maintaining a self-tuning histogram in database wherein user queries are executed on the database to return query results, and wherein the histogram has a plurality of buckets defined by at least two bucket boundaries and a bucket frequency comprising:
-
a) means for refining the bucket frequencies based on the results of an executed query against the database;
b) means for redefining one or more of the bucket boundaries based on the bucket frequencies comprising means for calculating an estimation error based on the size of a result of a user query and an estimated result size estimated using the histogram; and
wherein the means for refining refines the frequencies by distributing the estimation error amongst bucket frequencies. - View Dependent Claims (39)
-
-
40. An apparatus for maintaining a self-tuning histogram in database wherein user queries are executed on the database to return query results, and wherein the histogram has a plurality of buckets defined by at least two bucket boundaries and a bucket frequency comprising:
-
a) a memory device for storing a database comprising multiple data records;
b) a computer having one or more processing units for executing a stored computer program, said computer including a rapid access memory store; and
c) an interface for coupling the memory device for storing the database to the computer to allow records to be retrieved from the database;
whereind) the computer executing a stored program having software components including i) a component for providing a statistical model of the data distribution in the form of a self-tuning histogram having a plurality of buckets defined by at least two bucket boundaries and a bucket frequency;
ii) a component for refining the bucket frequencies based on the results of an executed query against the database comprising a component for determining an estimation error based on the result size of a user query and an estimated result size as determined by the statistical model; and
iii) a component for redefining at least one bucket boundary based on the bucket frequencies by applying the estimation error to the bucket frequencies.- View Dependent Claims (41)
-
Specification