Method and apparatus for estimating the number of occurrences of frequent values in a data set

US 5,542,089 A
Filed: 07/26/1994
Issued: 07/30/1996
Est. Priority Date: 07/26/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method of estimating the number of occurrences of values of query search keys in a data set stored in a digital computer for use by a query optimizer of the computer, the method comprising the steps of:

defining at least two independent hashing functions that map values of the data set to buckets of respective hashing tables that are maintained in data storage of the computer;

obtaining a current value from among the values in the data set;

mapping the current value to a multiplicity of hashing table buckets of the data storage that are defined by each hashing function and incrementing an associated bucket count in the data storage;

determining if the incremented bucket count of each hashing table satisfies predetermined criteria for being a popular bucket;

designating the current value as active if all of the buckets to which the current value is mapped are designated popular buckets and adding the current value to a list of active values in the data storage that are associated with at least one of the hashing tables;

collecting predetermined, statistical data related to the current value if it has been designated active;

repeating the steps of obtaining, mapping, determining, and designating until all values in the data set have been obtained; and

producing estimates of the most frequent values in the data set from the collected statistical data and providing them to the query optimizer.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A data base management system estimates the number of occurrences of values of query search keys in a data set by defining at least two independent hashing functions that map the values of the data set to buckets of respective hashing tables and maintaining a bucket count as each value from the data set is mapped to the hashing tables. A bucket is defined to be a "popular" bucket if the bucket count of the value exceeds a predetermined threshold. If all of the buckets to which a value is mapped are designated popular buckets, that value is designated an "active" value. Once a value is designated active, statistical data related to the value is collected. Estimates of the most frequently occurring values in the data set are generated from the collected statistical data. In this way, a data base management system can more effectively produce a search plan that provides an efficient response to user queries.

189 Citations

73 Claims

1. A method of estimating the number of occurrences of values of query search keys in a data set stored in a digital computer for use by a query optimizer of the computer, the method comprising the steps of:
- defining at least two independent hashing functions that map values of the data set to buckets of respective hashing tables that are maintained in data storage of the computer;
  
  obtaining a current value from among the values in the data set;
  
  mapping the current value to a multiplicity of hashing table buckets of the data storage that are defined by each hashing function and incrementing an associated bucket count in the data storage;
  
  determining if the incremented bucket count of each hashing table satisfies predetermined criteria for being a popular bucket;
  
  designating the current value as active if all of the buckets to which the current value is mapped are designated popular buckets and adding the current value to a list of active values in the data storage that are associated with at least one of the hashing tables;
  
  collecting predetermined, statistical data related to the current value if it has been designated active;
  
  repeating the steps of obtaining, mapping, determining, and designating until all values in the data set have been obtained; and
  
  producing estimates of the most frequent values in the data set from the collected statistical data and providing them to the query optimizer.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. A method as defined in claim 1, wherein the step of collecting statistical data includes the step of maintaining an active value count that is incremented for each currently mapped bucket of the active value.
  - 3. A method as defined in claim 2, wherein the step of incrementing the active value count comprises the steps of:
    - incrementing the active value count if the value was previously designated an active value; and
      
      initializing the active value count and collecting predetermined statistics if the value was newly designated an active value.
  - 4. A method as defined in claim 1, wherein the step of determining comprises the steps of:
    - designating a bucket as a popular bucket if the bucket count is one of the P highest bucket counts in the respective hashing table, where P is a predetermined popularity parameter; and
      
      removing the designation of a bucket as a popular bucket if the bucket count is the lowest among the popular buckets and the number of popular buckets is greater than the popularity parameter P.
  - 5. A method as defined in claim 4, wherein the parameter P is provided by a computer user.
  - 6. A method as defined in claim 4, wherein the popularity parameter P is the same for all hashing tables.
  - 7. A method as defined in claim 1, wherein the step of producing estimates of frequent values comprises producing a predetermined number F of estimates, wherein F is a most frequent values estimator parameter.
  - 8. A method as defined in claim 7, wherein the most frequent values parameter F is provided by a computer user.
  - 9. A method as defined in claim 1, wherein the step of producing estimates of the most frequent values in the data set comprises the steps of:
    - generating a frequent value estimate for each active value; and
      
      selecting the F highest estimates, wherein F is a predetermined most frequent values estimator parameter.
  - 10. A method as defined in claim 9, wherein the step of generating a frequent value estimate comprises selecting one estimator from a plurality of estimators in accordance with expected value distribution characteristics of the data set.
  - 11. A method as defined in claim 9, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating a Constant Rate estimator defined by the product of the occurrences of an active value and the ratio of the count value when the value was first designated an active value and the time since the value was first designated an active value to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 12. A method as defined in claim 9, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating a Bucket Rate estimator defined by the sum of the occurrences of an active value and to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 13. A method as defined in claim 9, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating an Active Rate estimator defined by the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 14. A method as defined in claim 9, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating a Bucket Values estimator defined by the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by the product of added with the product of to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 15. A method as defined in claim 14, wherein the number of values in a bucket is estimated by using a distinct input values estimator.
  - 16. A method as defined in claim 14, wherein the average number of occurrences per value is estimated dividing the total number of values by an estimated number of distinct values obtained by using a distinct input values estimator.

17. A method of estimating the most frequently occurring values of query search keys in a data set located in data storage of a digital computer for use by a data base manager of the computer in retrieving values from the data storage in accordance with a user query, the method comprising the steps of:
- (1) for each value in the data set, repeating the processing steps of(a) obtaining a data set value from among the values stored in the data storage,(b) mapping the value to a bucket in respective hashing tables of data storage that are defined by each of at least two independent hashing functions and incrementing an associated bucket count of each bucket in the data storage,(c) designating the bucket a popular bucket if the incremented bucket count is one of the P highest bucket counts in the respective hashing table, where P is a predetermined popularity parameter,(d) detecting if the number of buckets in the data storage that are designated popular buckets is greater than the popularity parameter P and removing the designation of a previously designated different bucket as a popular bucket if the bucket count of the different bucket is the lowest among the popular buckets,(e) designating the value an active value if all of the buckets to which the value is mapped have been designated popular buckets and adding the value to a list of active values in the data storage that is associated with at least one of the hashing tables,(f) collecting predetermined statistical data in the data storage relating to the value if it was designated an active value at the step of designating active values or if it was previously designated an active value,until all values in the data set have been processed;
  
  (2) producing an estimate of the F most frequent values in the data set using the collected statistical data after all values in the data set have been processed, where F is a predetermined frequency estimator parameter; and
  
  (3) providing the estimated most frequent values to the data base manager for use in generating a query plan to retrieve values in the data set and return the retrieved values to an output device.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 18. A method as defined in claim 17, wherein the step of collecting statistical data includes incrementing an active value count for the buckets to which the active value is mapped.
  - 19. A method as defined in claim 17, wherein the step of producing an estimate comprises the steps of:
    - generating a frequent value estimate for each active value; and
      
      selecting the F highest frequent value estimates.
  - 20. A method as defined in claim 19, wherein the step of generating a frequent value estimate comprises selecting one estimator from among a plurality of predetermined estimators in accordance with expected value distribution characteristics of the data set.
  - 21. A method as defined in claim 17, wherein the frequency estimator parameter F is provided by a computer user.
  - 22. A method as defined in claim 17, wherein the popularity parameter P is provided by a computer user.
  - 23. A method as defined in claim 17, wherein the popularity parameter P is the same for all the hashing tables.
  - 24. A method as defined in claim 17, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating a Constant Rate estimator defined by the product of the occurrences of an active value and the ratio of the count value when the value was first designated an active value and the time since the value was first designated an active value to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 25. A method as defined in claim 17, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating a Bucket Rate estimator defined by the sum of the occurrences of an active value and the product of (the occurrences of an active value) and (the ratio of the bucket count at the time the value was designated active to the bucket count since the time the value was designated active) to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 26. A method as defined in claim 17, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating an Active Rate estimator defined by the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by (the ratio of the number of occurrences of the value before it was designated active to the time since it was designated active) multiplied by (the bucket count at the time the value was designated active subtracted by the active value count) to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 27. A method as defined in claim 17, wherein the step of generating a frequent value estimate comprises the steps of:
    - calculating a Bucket Values estimator defined by the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by the product of added with the product of to generate the estimate for an active value;
      
      repeating the step of calculating for each active value; and
      
      returning the F highest estimates for the active values.
  - 28. A method as defined in claim 27, wherein the number of values in a bucket is estimated by a distinct input values estimator.
  - 29. A method as defined in claim 27, wherein the average number of occurrences per value is estimated by dividing the total number of values by an estimated number of distinct values obtained from a distinct input values estimator.

30. A computer combination having a central processor unit, data storage having a data set of values, and a data base management system having a data base manager that operates on the data set to retrieve values of query search keys in accordance with a user query, the combination including:
- at least two independent hashing functions maintained by the data base manager in the data storage of the computer that map the values of the data set to buckets of respective hashing tables in the data storage and increment a bucket count associated with each bucket when the respective function maps a value to the bucket;
  
  a query optimizer that determines a query 7plan to be followed by the data base manager in response to the user query; and
  
  a frequent values estimator, wherein for each value of the data set the frequent values estimator;
  
  determines if each bucket to which the value is mapped satisfies predetermined criteria for being a popular bucket, determines if the current value has been designated an active value, and maintains statistics on the value if it is active until all values of the data set have been mapped; and
  
  a report generator that determines the most frequent values in the data set and produces a report of the values to the query optimizer.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
- - 31. A combination as defined in claim 30, wherein the frequent values estimator determines if a bucket is a popular bucket by:
    - designating a bucket as a popular bucket if the bucket count is one of the P highest bucket counts in the respective hashing table, where P is a predetermined popularity parameter; and
      
      removing the designation of a bucket as a popular bucket if the bucket count is the lowest among the popular buckets and the number of popular buckets is greater than the popularity parameter P.
  - 32. A combination as defined in claim 31, wherein the parameter P is provided by a computer user.
  - 33. A combination as defined in claim 31, wherein the parameter P is the same for all hashing tables.
  - 34. A combination as defined in claim 30, wherein the frequent values estimator maintains statistics in the data storage relating to active values by incrementing an active value count for all of the buckets to which an active value is mapped.
  - 35. A combination as defined in claim 34, wherein the frequent values estimator increments an active value count for a value if the value was previously designated an active value and initializes the active value count if it newly designated the value as an active value.
  - 36. A combination as defined in claim 34, wherein the report generator produces estimates of occurrences of the active values and selects a predetermined number F of the highest estimates, wherein F is a predetermined most frequent values estimator parameter.
  - 37. A combination as defined in claim 36, wherein the most frequent values estimator parameter F is provided by a computer user.
  - 38. A combination as defined in claim 30, wherein the report generator produces estimates of the most frequent values in the data set by:
    - generating a frequent value estimate for each active value; and
      
      selecting the F highest estimates, wherein F is a predetermined most frequent values estimator parameter.
  - 39. A combination as defined in claim 38, wherein the report generator generates a frequent value estimate by selecting an estimator from among a plurality of predetermined estimators in accordance with expected value distribution characteristics of the data set.
  - 40. A combination as defined in claim 39, wherein the selected estimator comprises a Constant Rate estimator such that the report generator generates a frequent value estimate by calculating the product of the occurrences of an active value and the ratio of the count value when the value was first designated an active value and the time since the value was first designated an active value to generate the estimate for an active value.
  - 41. A combination as defined in claim 39, wherein the selected estimator comprises a Bucket Rate estimator such that the report generator generates a frequent value estimate by calculating the sum of the occurrences of an active value and the product of (the occurrences of an active value) and (the ratio of the bucket count at the time the value was designated active to the bucket count since the time the value was designated active) to generate the estimate for an active value.
  - 42. A combination as defined in claim 39, wherein the selected estimator comprises an Active Rate estimator such that the report generator generates a frequent value estimate by calculating the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by to generate the estimate for an active value.
  - 43. A combination as defined in claim 39, wherein the selected estimator comprises a Bucket Values estimator such that the report generator generates a frequent value estimate by calculating the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by the product of added with the product of to generate the estimate for an active value.
  - 44. A combination as defined in claim 43, wherein the report generator includes a Distinct Input Values estimator that estimates the number of distinct values mapped to a bucket.
  - 45. A combination as defined in claim 44, wherein the report generator estimates the average number of occurrences per value by dividing the total number of values by an estimated number of distinct values provided by the Distinct Input Values estimator.

46. A frequent values estimator system for use in a computer system having a central processor unit, data storage having a data set of values, and a data base management system having a data base manager that operates on the data set to retrieve values of query search keys in accordance with a user query, the system including:
- at least two independent hashing functions, maintained by the data base manager in the data storage of the computer, that map the values of the data set to buckets of respective hashing tables in the data storage and, increment a bucket count associated with each bucket when a value is mapped to the bucket;
  
  a query optimizer that determines a search plan to be followed by the data base manager in response to the user query;
  
  a frequent values estimator, wherein for each value of the data set, the frequent values estimator;
  
  determines if each bucket to which the value is mapped satisfies predetermined criteria for being a popular bucket,determines if the current value has been designated an active value, andmaintains statistics in the data storage relating to the value, if it is active, until all remaining values of the data set have been mapped; and
  
  a report generator that determines the estimated most frequent values in the data set and provides the determined values to the query optimizer for use by the query optimizer in determining the search plan.
- View Dependent Claims (47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
- - 47. A system as defined in claim 46, wherein the report generator produces estimates of the most frequent values in the data set by:
    - generating a frequent value estimate for each active value; and
      
      selecting the F highest estimates, wherein F is a most frequent values estimator parameter.
  - 48. A system as defined in claim 47, wherein the most frequent values estimator parameter F is provided by a computer user.
  - 49. A system as defined in claim 47, wherein the popularity parameter P is the same for all hashing tables.
  - 50. A system as defined in claim 47, wherein the popularity parameter P is provided by a computer user.
  - 51. A system as defined in claim 47, wherein the report generator generates a frequent value estimate by selecting an estimator from among a plurality of predetermined estimators in accordance with expected value distribution characteristics of the data set.
  - 52. A system as defined in claim 47, wherein the selected estimator comprises a Constant Rate estimator such that the report generator generates a frequent value estimate by calculating the product of the occurrences of an active value and the ratio of the count value when the value was first designated an active value and the time since the value was first designated an active value to generate the estimate for an active value.
  - 53. A system as defined in claim 47, wherein the selected estimator comprises a Bucket Rate estimator such that the report generator generates a frequent value estimate by calculating the sum of the occurrences of an active value and the product of to generate the estimate for an active value.
  - 54. A system as defined in claim 47, wherein the selected estimator comprises an Active Rate estimator such that the report generator generates a frequent value estimate by calculating the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by (the ratio of the number of occurrences of the value before it was designated active to the time since it was designated active) multiplied by (the bucket count at the time the value was designated active subtracted by the active value count) to generate the estimate for an active value.
  - 55. A system as defined in claim 47, wherein the selected estimator comprises a Bucket Values estimator such that the report generator generates a frequent value estimate by calculating the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by the product of added with the product of to generate the estimate for an active value.
  - 56. A system as defined in claim 55, wherein the report generator includes a distinct input values estimator that estimates the number of values in a bucket.
  - 57. A system as defined in claim 56, wherein the report generator estimates the average number of occurrences per value by dividing the total number of values by an estimated number of distinct values provided by the distinct input values estimator.

58. A computer system comprising:
- a computer terminal that receives commands from a terminal user;
  
  a data storage unit that receives a data set having values of query search keys;
  
  a data base management system having a data base manager that operates on the data set to retrieve values from among the values in the data storage unit in accordance with a user query and that maintains at least two independent hashing functions that map the values of the data set to buckets of respective hashing tables in the data storage unit and increment a bucket count of the data storage unit associated with each bucket when a value is mapped to the bucket;
  
  query optimizer means for determining a query plan to be followed by the data base manager in response to the user query;
  
  frequent values estimator means for determining, for each value of the data set, if each bucket to which the value is mapped satisfies predetermined criteria for being a popular bucket, determining if the current value has been designated an active value, and maintaining statistics on the value if it is active, until all values of the data set have been mapped; and
  
  a report generator that determines the estimated most frequent values in the data set based on the statistics and produces a report of the values to the query optimizer means.
- View Dependent Claims (59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73)
- - 59. A computer system as defined in claim 58, wherein the frequent values estimator means determines if a bucket is a popular bucket by:
    - designating a bucket as a popular bucket if the bucket count is one of the P highest bucket counts in the respective hashing table, where P is a predetermined popularity parameter; and
      
      removing the designation of a bucket as a popular bucket if the bucket count is the lowest among the popular buckets and the number of popular buckets is greater than the popularity parameter P.
  - 60. A computer system as defined in claim 59, wherein the popularity parameter P is provided by a computer user.
  - 61. A computer system as defined in claim 59, wherein the popularity parameter P is the same for all hashing tables.
  - 62. A computer system as defined in claim 58, wherein the frequent values estimator means maintains statistics on active values by incrementing an active value count for all of the buckets to which an active value is mapped.
  - 63. A computer system as defined in claim 62, wherein the frequent values estimator means increments an active value count for a value if the value was previously designated an active value and initializes the active value count if it newly designated the value as an active value.
  - 64. A computer system as defined in claim 62, wherein the report generator produces estimates of occurrences of the active values and selects a predetermined number F of the highest estimates, wherein F is a predetermined most frequent values estimator parameter.
  - 65. A computer system as defined in claim 64, wherein the most frequent values estimator means parameter F is provided by a computer user.
  - 66. A computer system as defined in claim 58, wherein the report generator produces estimates of the most frequent values in the data set by:
    - generating a frequent value estimate for each active value; and
      
      selecting the F highest estimates, wherein F is a predetermined most frequent values estimator parameter.
  - 67. A computer system as defined in claim 66, wherein the report generator generates a frequent value estimate by selecting an estimator from among a plurality of predetermined estimators in accordance with expected value distribution characteristics of the data set.
  - 68. A computer system as defined in claim 67, wherein the selected estimator comprises a Constant Rate estimator such that the report generator generates a frequent value estimate by calculating the product of the occurrences of an active value and the ratio of the count value when the value was first designated an active value and the time since the value was first designated an active value to generate the estimate for an active value.
  - 69. A computer system as defined in claim 67, wherein the selected estimator comprises a Bucket Rate estimator such that the report generator generates a frequent value estimate by calculating the sum of the occurrences of an active value and the product of to generate the estimate for an active value.
  - 70. A computer system as defined in claim 67, wherein the selected estimator comprises an Active Rate estimator such that the report generator generates a frequent value estimate by calculating the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by (the ratio of the number of occurrences of the value before it was designated active to the time since it was designated active) multiplied by (the bucket count at the time the value was designated active subtracted by the active value count) to generate the estimate for an active value.
  - 71. A computer system as defined in claim 67, wherein the selected estimator comprises a Bucket Values estimator such that the report generator generates a frequent value estimate by calculating the number of occurrences of an active value added with the bucket count at the time the value was designated active subtracted by the product of added with the product of to generate the estimate for an active value.
  - 72. A computer system as defined in claim 71, wherein the report generator includes a distinct input values estimator that estimates the number of values in a bucket.
  - 73. A computer system as defined in claim 71, wherein the report generator estimates the average number of occurrences per value by dividing the total number of values by an estimated number of distinct values provided by the distinct input values estimator.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Lindsay, Bruce G., Shekita, Eugene J.
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
Alam, Hosain T.

Application Number

US08/280,623
Time in Patent Office

735 Days
Field of Search

395/425, 395/575, 395/600, 395/650, 395/800, 364/419.13, 364/419.19, 370/85.13, 370/92, 370/96.1
US Class Current

1/1
CPC Class Codes

G06F 16/24547   Optimisations to support sp...

G06F 16/2462   Approximate or statistical ...

G06F 16/9014   hash tables

Y10S 707/99932   Access augmentation or opti...

Method and apparatus for estimating the number of occurrences of frequent values in a data set

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

189 Citations

73 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for estimating the number of occurrences of frequent values in a data set

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

189 Citations

73 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others