Method and apparatus for estimating the number of occurrences of frequent values in a data set
First Claim
1. A method of estimating the number of occurrences of values of query search keys in a data set stored in a digital computer for use by a query optimizer of the computer, the method comprising the steps of:
- defining at least two independent hashing functions that map values of the data set to buckets of respective hashing tables that are maintained in data storage of the computer;
obtaining a current value from among the values in the data set;
mapping the current value to a multiplicity of hashing table buckets of the data storage that are defined by each hashing function and incrementing an associated bucket count in the data storage;
determining if the incremented bucket count of each hashing table satisfies predetermined criteria for being a popular bucket;
designating the current value as active if all of the buckets to which the current value is mapped are designated popular buckets and adding the current value to a list of active values in the data storage that are associated with at least one of the hashing tables;
collecting predetermined, statistical data related to the current value if it has been designated active;
repeating the steps of obtaining, mapping, determining, and designating until all values in the data set have been obtained; and
producing estimates of the most frequent values in the data set from the collected statistical data and providing them to the query optimizer.
3 Assignments
0 Petitions
Accused Products
Abstract
A data base management system estimates the number of occurrences of values of query search keys in a data set by defining at least two independent hashing functions that map the values of the data set to buckets of respective hashing tables and maintaining a bucket count as each value from the data set is mapped to the hashing tables. A bucket is defined to be a "popular" bucket if the bucket count of the value exceeds a predetermined threshold. If all of the buckets to which a value is mapped are designated popular buckets, that value is designated an "active" value. Once a value is designated active, statistical data related to the value is collected. Estimates of the most frequently occurring values in the data set are generated from the collected statistical data. In this way, a data base management system can more effectively produce a search plan that provides an efficient response to user queries.
189 Citations
73 Claims
-
1. A method of estimating the number of occurrences of values of query search keys in a data set stored in a digital computer for use by a query optimizer of the computer, the method comprising the steps of:
-
defining at least two independent hashing functions that map values of the data set to buckets of respective hashing tables that are maintained in data storage of the computer; obtaining a current value from among the values in the data set; mapping the current value to a multiplicity of hashing table buckets of the data storage that are defined by each hashing function and incrementing an associated bucket count in the data storage; determining if the incremented bucket count of each hashing table satisfies predetermined criteria for being a popular bucket; designating the current value as active if all of the buckets to which the current value is mapped are designated popular buckets and adding the current value to a list of active values in the data storage that are associated with at least one of the hashing tables; collecting predetermined, statistical data related to the current value if it has been designated active; repeating the steps of obtaining, mapping, determining, and designating until all values in the data set have been obtained; and producing estimates of the most frequent values in the data set from the collected statistical data and providing them to the query optimizer. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method of estimating the most frequently occurring values of query search keys in a data set located in data storage of a digital computer for use by a data base manager of the computer in retrieving values from the data storage in accordance with a user query, the method comprising the steps of:
-
(1) for each value in the data set, repeating the processing steps of (a) obtaining a data set value from among the values stored in the data storage, (b) mapping the value to a bucket in respective hashing tables of data storage that are defined by each of at least two independent hashing functions and incrementing an associated bucket count of each bucket in the data storage, (c) designating the bucket a popular bucket if the incremented bucket count is one of the P highest bucket counts in the respective hashing table, where P is a predetermined popularity parameter, (d) detecting if the number of buckets in the data storage that are designated popular buckets is greater than the popularity parameter P and removing the designation of a previously designated different bucket as a popular bucket if the bucket count of the different bucket is the lowest among the popular buckets, (e) designating the value an active value if all of the buckets to which the value is mapped have been designated popular buckets and adding the value to a list of active values in the data storage that is associated with at least one of the hashing tables, (f) collecting predetermined statistical data in the data storage relating to the value if it was designated an active value at the step of designating active values or if it was previously designated an active value, until all values in the data set have been processed; (2) producing an estimate of the F most frequent values in the data set using the collected statistical data after all values in the data set have been processed, where F is a predetermined frequency estimator parameter; and (3) providing the estimated most frequent values to the data base manager for use in generating a query plan to retrieve values in the data set and return the retrieved values to an output device. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. A computer combination having a central processor unit, data storage having a data set of values, and a data base management system having a data base manager that operates on the data set to retrieve values of query search keys in accordance with a user query, the combination including:
-
at least two independent hashing functions maintained by the data base manager in the data storage of the computer that map the values of the data set to buckets of respective hashing tables in the data storage and increment a bucket count associated with each bucket when the respective function maps a value to the bucket; a query optimizer that determines a query 7plan to be followed by the data base manager in response to the user query; and a frequent values estimator, wherein for each value of the data set the frequent values estimator; determines if each bucket to which the value is mapped satisfies predetermined criteria for being a popular bucket, determines if the current value has been designated an active value, and maintains statistics on the value if it is active until all values of the data set have been mapped; and a report generator that determines the most frequent values in the data set and produces a report of the values to the query optimizer. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
-
-
46. A frequent values estimator system for use in a computer system having a central processor unit, data storage having a data set of values, and a data base management system having a data base manager that operates on the data set to retrieve values of query search keys in accordance with a user query, the system including:
-
at least two independent hashing functions, maintained by the data base manager in the data storage of the computer, that map the values of the data set to buckets of respective hashing tables in the data storage and, increment a bucket count associated with each bucket when a value is mapped to the bucket; a query optimizer that determines a search plan to be followed by the data base manager in response to the user query; a frequent values estimator, wherein for each value of the data set, the frequent values estimator; determines if each bucket to which the value is mapped satisfies predetermined criteria for being a popular bucket, determines if the current value has been designated an active value, and maintains statistics in the data storage relating to the value, if it is active, until all remaining values of the data set have been mapped; and a report generator that determines the estimated most frequent values in the data set and provides the determined values to the query optimizer for use by the query optimizer in determining the search plan. - View Dependent Claims (47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
-
-
58. A computer system comprising:
-
a computer terminal that receives commands from a terminal user; a data storage unit that receives a data set having values of query search keys; a data base management system having a data base manager that operates on the data set to retrieve values from among the values in the data storage unit in accordance with a user query and that maintains at least two independent hashing functions that map the values of the data set to buckets of respective hashing tables in the data storage unit and increment a bucket count of the data storage unit associated with each bucket when a value is mapped to the bucket; query optimizer means for determining a query plan to be followed by the data base manager in response to the user query; frequent values estimator means for determining, for each value of the data set, if each bucket to which the value is mapped satisfies predetermined criteria for being a popular bucket, determining if the current value has been designated an active value, and maintaining statistics on the value if it is active, until all values of the data set have been mapped; and a report generator that determines the estimated most frequent values in the data set based on the statistics and produces a report of the values to the query optimizer means. - View Dependent Claims (59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73)
-
Specification