Statistical representation of skewed data
First Claim
Patent Images
1. A method for representing statistics about a table including one or more rows, each row including a respective value, the method including:
- creating one or more histogram buckets, each histogram bucket including a width representing a respective range of values and a height representing a count of rows in the table having values in the range of values;
creating one or more high-bias buckets, each high-bias bucket including one or more high-bias values up to a maximum number of high-bias values (F) that appear in a minimum percentage of rows in the table and for each high-bias value a number of rows that contain the high-bias value;
where the minimum percentage of rows is computed using F and B, where B is the total number of buckets;
repeating the following;
(a) determining an average height of the histogram buckets;
(b) determining a reclassification threshold based on the average height of the histogram buckets; and
(c) concluding that a value associated with one of the one or more histogram buckets occurs in more rows of the table than the reclassification threshold, and, in response, concluding that the number of high-bias values associated with at least one of the one or more high-bias buckets has not reached the maximum number of high-bias values, and, in response, including the value in one of the high-bias buckets for which the number of high-bias values has not reached the maximum number of high-bias values;
until no values included in any of the ranges of values associated with the histogram buckets occur in more than the reclassification threshold number of rows in the table; and
saving in a memory the width and the height of each of the one or more histogram buckets and the one or more high-bias values and numbers of rows for each of the one or more high-bias buckets.
2 Assignments
0 Petitions
Accused Products
Abstract
A method, database system, and computer program for collecting statistics about a table are disclosed. The table includes one or more rows and each row includes a respective value. The method includes creating zero or more histogram buckets. Each histogram bucket includes a width representing a respective range of values and a height representing a count of rows having values in the range of values. The method further includes creating one or more high-bias buckets, each high-bias bucket represents one or more values that appear in a minimum percentage of rows.
-
Citations
15 Claims
-
1. A method for representing statistics about a table including one or more rows, each row including a respective value, the method including:
-
creating one or more histogram buckets, each histogram bucket including a width representing a respective range of values and a height representing a count of rows in the table having values in the range of values; creating one or more high-bias buckets, each high-bias bucket including one or more high-bias values up to a maximum number of high-bias values (F) that appear in a minimum percentage of rows in the table and for each high-bias value a number of rows that contain the high-bias value; where the minimum percentage of rows is computed using F and B, where B is the total number of buckets; repeating the following; (a) determining an average height of the histogram buckets; (b) determining a reclassification threshold based on the average height of the histogram buckets; and (c) concluding that a value associated with one of the one or more histogram buckets occurs in more rows of the table than the reclassification threshold, and, in response, concluding that the number of high-bias values associated with at least one of the one or more high-bias buckets has not reached the maximum number of high-bias values, and, in response, including the value in one of the high-bias buckets for which the number of high-bias values has not reached the maximum number of high-bias values; until no values included in any of the ranges of values associated with the histogram buckets occur in more than the reclassification threshold number of rows in the table; and saving in a memory the width and the height of each of the one or more histogram buckets and the one or more high-bias values and numbers of rows for each of the one or more high-bias buckets. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A database system including:
-
a massively parallel processing system including; one or more nodes; a plurality of CPUs, each of the one or more nodes providing access to one or more CPUs; a plurality of data storage facilities each of the one or more CPUs providing access to one or more data storage facilities; P partitions, each partition residing on one or more data storage facilities; a process for representing statistics, where the database system represents statistics about a table including one or more rows, each row including a respective value, the process including; creating one or more histogram buckets, each histogram bucket including a width representing a respective range of values and a height representing a count of rows in the table having values in the range of values; creating one or more high-bias buckets, each high-bias bucket including one or more high-bias values up to a maximum number of high-bias values (F) that appear in a minimum percentage of rows in the table and for each high-bias value a number of rows that contain the high-bias value; where the minimum percentage of rows is computed using F and B, where B is the total number of buckets; repeating the following; (a) determining an average height of the histogram buckets; (b) determining a reclassification threshold based on the average height of the histogram buckets; and (c) concluding that a value associated with one of the one or more histogram buckets occurs in more rows of the table than the reclassification threshold, and, in response, concluding that the number of high-bias values associated with at least one of the one or more high-bias buckets has not reached the maximum number of high-bias values, and, in response, including the value in one of the high-bias buckets for which the number of high-bias values has not reached the maximum number of high-bias values; until no values included in any of the ranges of values associated with the histogram buckets occur in more than the reclassification threshold number of rows in the table; and saving in a memory the width and the height of each of the one or more histogram buckets and the one or more high-bias values and numbers of rows for each of the one or more high-bias buckets. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer program, stored on a tangible storage medium, for use in representing statistics in a database running in a partitioned parallel environment including P partitions, each partition residing on one or more parallel processing systems, the database including a first table including one or more rows stored in one or more of the P partitions, the program including executable instructions that cause a computer to:
represent statistics about a table including one or more rows, each row including one or more values, the program further causing the computer to; create one or more histogram buckets, each histogram bucket including a width representing a respective range of values and a height representing a count of rows in the table having values in the range of values; create one or more high-bias buckets, each high-bias bucket including one or more high-bias values up to a maximum number of high-bias values (F) that appear in a minimum percentage of rows in the table and for each high-bias value a number of rows that contain the high-bias value; where the minimum percentage of rows is computed using F and B, where B is the total number of buckets; repeat the following; (a) determine an average height of the histogram buckets; (b) determine a reclassification threshold based on the average height of the histogram buckets; and (c) conclude that a value associated with one of the one or more histogram buckets occurs in more rows of the table than the reclassification threshold, and, in response, conclude that the number of high-bias values associated with at least one of the one or more high-bias buckets has not reached the maximum number of high-bias values, and, in response, include the value in one of the high-bias buckets for which the number of high-bias values has not reached the maximum number of high-bias values; until no values included in any of the ranges of values associated with the histogram buckets occur in more than the reclassification threshold number of rows in the table; and save in a memory the width and the height of each of the one or more histogram buckets and the one or more high-bias values and numbers of rows for each of the one or more high-bias buckets. - View Dependent Claims (12, 13, 14, 15)
Specification