Database system and method employing data cube operator for group-by operations

US 5,832,475 A
Filed: 03/29/1996
Issued: 11/03/1998
Est. Priority Date: 03/29/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A method, for execution by a database system, for performing GROUP-BY operations on a dataset of attributes, the method comprising the steps of:

generating an operator which represents (i) possible GROUP-BY operations on the dataset of attributes and (ii) hierarchies of the possible GROUP-BY operations, using a predetermined optimization technique;

computing a first GROUP-BY operation on a first subset of the dataset of attributes using the operator; and

computing a second GROUP-BY operation on a second subset of the dataset of attributes using the operator, the second subset being a proper subset of the first subset, based on the results of the step of computing a first GROUP-BY operation on the first subset of the attributes.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a system and method for performing database queries including GROUP-BY operations, in which aggregate values for attributes are desired for distinct, partitioned subsets of tuples satisfying a query. A special case of the aggregation problem is addressed, employing a structure, called the data cube operator, which provides information useful for expediting execution of GROUP-BY operations in queries. Algorithms are provided for constructing the data cube by efficiently computing a collection of GROUP-BYs on the attributes of the relation. Decision support systems often require computation of multiple GROUP-BY operations on a given set of attributes, the GROUP-BYs being related in the sense that their attributes are subsets or supersets of each other. The invention extends hash-based and sort-based grouping methods with optimizations, including combining common operations across multiple GROUP-BYs and using pre-computed GROUP-BYs for computing other GROUP-BYs. An extension of the cube algorithms handles any given collection of aggregates.

188 Citations

69 Claims

1. A method, for execution by a database system, for performing GROUP-BY operations on a dataset of attributes, the method comprising the steps of:
- generating an operator which represents (i) possible GROUP-BY operations on the dataset of attributes and (ii) hierarchies of the possible GROUP-BY operations, using a predetermined optimization technique;
  
  computing a first GROUP-BY operation on a first subset of the dataset of attributes using the operator; and
  
  computing a second GROUP-BY operation on a second subset of the dataset of attributes using the operator, the second subset being a proper subset of the first subset, based on the results of the step of computing a first GROUP-BY operation on the first subset of the attributes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. A method as recited in claim 1, wherein the step of generating an operator includes generating a data cube operator.
  - 3. A method as recited in claim 1, wherein:
    - the method further includes using the operator to identify, for a given one of possible GROUP-BY operations for the attributes of the relation, parent ones of the possible GROUP-BY operations from which the given GROUP-BY operation may be computed; and
      
      the step of computing a second GROUP-BY operation includes computing the second GROUP-BY operation based on the results of one of the parent ones of the possible GROUP-BY operations, identified for the second GROUP-BY operation in the step of identifying, which is smallest of the parent ones of the possible GROUP-BY operations.
  - 4. A method as recited in claim 1, wherein the step of generating an operator includes using an optimization technique of:
    - computing a first GROUP-BY to produce results thereof;
      
      storing the results of the first GROUP-BY in cache memory; and
      
      computing another GROUP-BY based on the results of the first GROUP-BY operation stored in cache memory by the step of storing.
  - 5. A method as recited in claim 1, further comprising using an optimization technique of:
    - computing a first GROUP-BY to produce results thereof;
      
      storing the results of the first GROUP-BY in disk storage;
      
      scanning the results of the first GROUP-BY in disk storage; and
      
      computing a plurality of other GROUP-BYs concurrently based on the scanned results of the first GROUP-BY.
  - 6. A method as recited in claim 1, wherein the steps of computing first and second GROUP-BYs are performed using a sort-based method.
  - 7. A method as recited in claim 6, further comprising the step of computing multiple GROUP-BYs using a single sort.
  - 8. A method as recited in claim 7, wherein the step of computing multiple GROUP-BYs using a single sort includes computing the GROUP-BYs in a pipelined fashion.
  - 9. A method as recited in claim 7, wherein the step of computing multiple GROUP-BYs using a single sort includes the steps of:
    - identifying respective levels within the operator generated by the step of generating an operator, the levels having a hierarchy based on the number of attributes in each GROUP-BY in the levels; and
      
      determining, for each level having a successively smaller number of attributes per GROUP-BY, a lowest-cost way of computing the GROUP-BYs of that level, based on results of computations of GROUP-BYs for a previous level whose GROUP-BYs have one greater number of attributes, the step of determining being made based on a reduction to weighted bipartite patching.
  - 10. A method as recited in claim 1, wherein the steps of computing first and second GROUP-BYs are performed using a hash-based method.
  - 11. A method as recited in claim 10, further comprising the steps of:
    - producing hash tables for multiple GROUP-BYs; and
      
      storing the hash tables in cache memory; and
      
      computing successive GROUP-BYs from the hash tables stored in cache memory.
  - 12. A method as recited in claim 11, wherein the step of producing hash tables includes one of (i) employing a multidimensional array, and (ii) concatenating a predetermined number of bits for each attribute, based on how sparse is a distribution of data for the attributes.
  - 13. A method as recited in claim 10, further comprising the steps of:
    - partitioning based on an attribute; and
      
      computing multiple GROUP-BYs which include the attribute upon which the step of partitioning was based.
  - 14. A method as recited in claim 13, wherein the step of partitioning is performed on a portion of the dataset.
  - 15. A method as recited in claim 1, further comprising the steps of:
    - estimating a size of a GROUP-BY; and
      
      selecting, based on the estimated size, whether the step of computing the GROUP-BY is to be done using a sort-based method or a hash-based method.
  - 16. A method as recited in claim 1, further comprising the steps of:
    - estimating a size of a GROUP-BY; and
      
      selecting, based on the estimated size, whether the step of computing the GROUP-BY is to be done using a hash-based method including producing hash tables by (I) employing a multidimensional array, and (ii) concatenating a predetermined number of bits for each attribute.
  - 17. A method as recited in claim 1, further comprising the steps of:
    - identifying respective levels within the operator generated by the step of generating an operator, the levels having a hierarchy based on the number of attributes in each GROUP-BY in the levels; and
      
      determining, for each level having a successively smaller number of attributes per GROUP-BY, a lowest-cost way of computing the GROUP-BYs of that level, based on results of computations of GROUP-BYs for a previous level whose GROUP-BYs have more than one greater number of attributes.
  - 18. A method as recited in claim 1, wherein:
    - the method is performed responsive to a received query which requests a subset of all possible GROUP-BY operations; and
      
      the step of generating an operator includes generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset.
  - 19. A method as recited in claim 18, wherein the step of generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset includes:
    - generating an intermediate GROUP-BY operation not included in the requested subset; and
      
      generating one of the requested GROUP-BY operations from the intermediate GROUP-BY operation.
  - 20. A method as recited in claim 18, wherein the step of generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset includes generating a minimal cost Steiner tree.
  - 21. A method as recited in claim 18, wherein the step of generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset is performed using a sort-based method.
  - 22. A method as recited in claim 18, wherein the step of generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset is performed using a hash-based method.
  - 23. A method as recited in claim 1, wherein:
    - one of the attributes of the dataset has a hierarchy of levels defined on it;
      
      the method is performed responsive to a received query which requests GROUP-BY operations in which different designated levels of the hierarchy for the attribute are requested; and
      
      the step of generating an operator includes using an optimization technique of identifying, for a given one of the GROUP-BY operations including a higher level of the hierarchy for the attribute having a hierarchy of levels, a parent one of the GROUP-BY operations which includes a lower level of the hierarchy for the attribute having a hierarchy of levels, from which the given GROUP-BY operation may be computed.

24. A database system, for performing GROUP-BY operations on a dataset of attributes, the system comprising:
- means for generating an operator which represents (i) possible GROUP-BY operations on the dataset of attributes and (ii) hierarchies of the possible GROUP-BY operations, using a predetermined optimization technique;
  
  means for computing a first GROUP-BY operation on a first subset of the dataset of attributes using the operator; and
  
  means for computing a second GROUP-BY operation on a second subset of the dataset of attributes using the operator, the second subset being a proper subset of the first subset, based on the results of the means for computing a first GROUP-BY operation on the first subset of the attributes.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 25. A system as recited in claim 24, wherein the means for generating an operator includes means for generating a data cube operator.
  - 26. A system as recited in claim 24, wherein:
    - the system further includes means for using the operator to identify, for a given one of possible GROUP-BY operations for the attributes of the relation, parent ones of the possible GROUP-BY operations from which the given GROUP-BY operation may be computed; and
      
      the means for computing a second GROUP-BY operation includes means for computing the second GROUP-BY operation based on the results of one of the parent ones of the possible GROUP-BY operations, identified for the second GROUP-BY operation by the means for identifying, which is smallest of the parent ones of the possible GROUP-BY operations.
  - 27. A system as recited in claim 24, wherein the means for generating an operator includes means for using an optimization technique of:
    - computing a first GROUP-BY to produce results thereof;
      
      storing the results of the first GROUP-BY in cache memory; and
      
      computing another GROUP-BY based on the results of the first GROUP-BY operation stored in cache memory by the step of storing.
  - 28. A system as recited in claim 24, further comprising means for using an optimization technique of:
    - computing a first GROUP-BY to produce results thereof;
      
      storing the results of the first GROUP-BY in disk storage;
      
      scanning the results of the first GROUP-BY in disk storage; and
      
      computing a plurality of other GROUP-BYs concurrently based on the scanned results of the first GROUP-BY.
  - 29. A system as recited in claim 24, wherein the means for computing first and second GROUP-BYs include means for using a sort-based method.
  - 30. A system as recited in claim 29, further comprising the means for computing multiple GROUP-BYs using a single sort.
  - 31. A system as recited in claim 30, wherein the means for computing multiple GROUP-BYs using a single sort includes means for computing the GROUP-BY s in a pipelined fashion.
  - 32. A system as recited in claim 30, wherein the means for computing multiple GROUP-BYs using a single sort includes:
    - means for identifying respective levels within the operator generated by the means for generating an operator, the levels having a hierarchy based on the number of attributes in each GROUP-BY in the levels; and
      
      means for determining, for each level having a successively smaller number of attributes per GROUP-BY, a lowest-cost way of computing the GROUP-BYs of that level, based on results of computations of GROUP-BYs for a previous level whose GROUP-BYs have one greater number of attributes, the means for determining being made based on a reduction to weighted bipartite patching.
  - 33. A system as recited in claim 24, wherein the means for computing first and second GROUP-BYs include means for using a hash-based method.
  - 34. A system as recited in claim 33, further comprising:
    - means for producing hash tables for multiple GROUP-BYs; and
      
      means for storing the hash tables in cache memory; and
      
      means for computing successive GROUP-BYs from the hash tables stored in cache memory.
  - 35. A system as recited in claim 34, wherein the means for producing hash tables includes means for one of (i) employing a multidimensional array, and (ii) concatenating a predetermined number of bits for each attribute, based on how sparse is a distribution of data for the attributes.
  - 36. A system as recited in claim 33, further comprising:
    - means for partitioning based on an attribute; and
      
      means for computing multiple GROUP-BYs which include the attribute upon which the means for of partitioning was based.
  - 37. A system as recited in claim 36, wherein the means for partitioning operates on a portion of the dataset.
  - 38. A system as recited in claim 24, further comprising:
    - means for estimating a size of a GROUP-BY; and
      
      means for selecting, based on the estimated size, whether the means for computing the GROUP-BY is to operate using a sort-based method or a hash-based method.
  - 39. A system as recited in claim 24, further comprising:
    - means for estimating a size of a GROUP-BY; and
      
      means for selecting, based on the estimated size, whether the means for computing the GROUP-BY is to operate using a hash-based method including producing hash tables by (I) employing a multidimensional array, and (ii) concatenating a predetermined number of bits for each attribute.
  - 40. A system as recited in claim 24, further comprising:
    - means for identifying respective levels within the operator generated by the means for generating an operator, the levels having a hierarchy based on the number of attributes in each GROUP-BY in the levels; and
      
      means for determining, for each level having a successively smaller number of attributes per GROUP-BY, a lowest-cost way of computing the GROUP-BYs of that level, based on results of computations of GROUP-BYs for a previous level whose GROUP-BYs have more than one greater number of attributes.
  - 41. A system as recited in claim 24, wherein:
    - the system is operative responsive to a received query which requests a subset of all possible GROUP-BY operations; and
      
      the means for generating an operator includes means for generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset.
  - 42. A system as recited in claim 41, wherein the means for generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset includes:
    - means for generating an intermediate GROUP-BY operation not included in the requested subset; and
      
      means for generating one of the requested GROUP-BY operations from the intermediate GROUP-BY operation.
  - 43. A system as recited in claim 41, wherein the means for generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset includes generating a minimal cost Steiner tree.
  - 44. A system as recited in claim 41, wherein the means for generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset is performed using a sort-based method.
  - 45. A system as recited in claim 41, wherein the means for generating GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset is performed using a hash-based method.
  - 46. A system as recited in claim 24, wherein:
    - one of the attributes of the dataset has a hierarchy of levels defined on it;
      
      the system is operative responsive to a received query which requests GROUP-BY operations in which different designated levels of the hierarchy for the attribute are requested; and
      
      the means for generating an operator includes means for using an optimization technique of identifying, for a given one of the GROUP-BY operations including a higher level of the hierarchy for the attribute having a hierarchy of levels, a parent one of the GROUP-BY operations which includes a lower level of the hierarchy for the attribute having a hierarchy of levels, from which the given GROUP-BY operation may be computed.

47. A computer program product, for use with a database and processing system, for directing the system to perform GROUP-BY operations on a dataset of attributes, the computer program product comprising:
- a computer readable medium;
  
  means, provided on the computer-readable medium, for directing the database and processing system to generate an operator which represents (i) possible GROUP-BY operations on the dataset of attributes and (ii) hierarchies of the possible GROUP-BY operations, using a predetermined optimization technique;
  
  means, provided on the computer-readable medium, for directing the database and processing system to compute a first GROUP-BY operation on a first subset of the dataset of attributes using the operator; and
  
  means, provided on the computer-readable medium, for directing the database and processing system to compute a second GROUP-BY operation on a second subset of the dataset of attributes using the operator, the second subset being a proper subset of the first subset, based on the results of the means for directing to compute a first GROUP-BY operation on the first subset of the attributes.
- View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69)
- - 48. A computer program product as recited in claim 47, wherein the means for directing to generate an operator includes means, provided on the computer-readable medium, for directing the database and processing system to generate a data cube operator.
  - 49. A computer program product as recited in claim 47, wherein:
    - the method CPP further includes means, provided on the computer-readable medium, for directing the database and processing system to use the operator to identify, for a given one of possible GROUP-BY operations for the attributes of the relation, parent ones of the possible GROUP-BY operations from which the given GROUP-BY operation may be computed; and
      
      the means for directing to compute a second GROUP-BY operation includes means, provided on the computer-readable medium, for directing the database and processing system to computing the second GROUP-BY operation based on the results of one of the parent ones of the possible GROUP-BY operations, identified for the second GROUP-BY operation by the means for directing to identify, which is smallest of the parent ones of the possible GROUP-BY operations.
  - 50. A computer program product as recited in claim 47, wherein the means for directing to generate an operator includes means, provided on the computer-readable medium, for directing the database and processing system to use an optimization technique of:
    - computing a first GROUP-BY to produce results thereof;
      
      storing the results of the first GROUP-BY in cache memory; and
      
      computing another GROUP-BY based on the results of the first GROUP-BY operation stored in cache memory by the step of storing.
  - 51. A computer program product as recited in claim 47, further comprising means, provided on the computer-readable medium, for directing the database and processing system to use an optimization technique of:
    - computing a first GROUP-BY to produce results thereof;
      
      storing the results of the first GROUP-BY in disk storage;
      
      scanning the results of the first GROUP-BY in disk storage; and
      
      computing a plurality of other GROUP-BYs concurrently based on the scanned results of the first GROUP-BY.
  - 52. A computer program product as recited in claim 47, wherein the means for directing to compute first and second GROUP-BYs include means, provided on the computer-readable medium, for directing the database and processing system to use a sort-based method.
  - 53. A computer program product as recited in claim 52, further comprising the means, provided on the computer-readable medium, for directing the database and processing system to compute multiple GROUP-BYs using a single sort.
  - 54. A computer program product as recited in claim 53, wherein the means for directing to compute multiple GROUP-BYs using a single sort includes means, provided on the computer-readable medium, for directing the database and processing system to compute the GROUP-BY s in a pipelined fashion.
  - 55. A computer program product as recited in claim 53, wherein the means for directing to compute multiple GROUP-BYs using a single sort includes:
    - means, provided on the computer-readable medium, for directing the database and processing system to identify respective levels within the operator generated by the means for directing to generate an operator, the levels having a hierarchy based on the number of attributes in each GROUP-BY in the levels; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to determine, for each level having a successively smaller number of attributes per GROUP-BY, a lowest-cost way of computing the GROUP-BYs of that level, based on results of computations of GROUP-BYs for a previous level whose GROUP-BYs have one greater number of attributes, the means for directing to determine being made based on a reduction to weighted bipartite patching.
  - 56. A computer program product as recited in claim 47, wherein the means for directing to compute first and second GROUP-BYs include means, provided on the computer-readable medium, for directing the database and processing system to use a hash-based method.
  - 57. A computer program product as recited in claim 56, further comprising:
    - means, provided on the computer-readable medium, for directing the database and processing system to producing hash tables for multiple GROUP-BYs; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to storing the hash tables in cache memory; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to computing successive GROUP-BYs from the hash tables stored in cache memory.
  - 58. A computer program product as recited in claim 57, wherein the means for directing to produce hash tables includes means, provided on the computer-readable medium, for directing the database and processing system to perform one of (I) employ a multidimensional array, and (ii) concatenate a predetermined number of bits for each attribute, based on how sparse is a distribution of data for the attributes.
  - 59. A computer program product as recited in claim 56, further comprising:
    - means, provided on the computer-readable medium, for directing the database and processing system to partition based on an attribute; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to compute multiple GROUP-BYs which include the attribute upon which the step of partitioning was based.
  - 60. A computer program product as recited in claim 59, wherein the means for directing to partition operates on a portion of the dataset.
  - 61. A computer program product as recited in claim 47, further comprising the steps of:
    - means, provided on the computer-readable medium, for directing the database and processing system to estimate a size of a GROUP-BY; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to select, based on the estimated size, whether the means for directing to compute the GROUP-BY is operable using a sort-based method or a hash-based method.
  - 62. A computer program product as recited in claim 47, further comprising:
    - means, provided on the computer-readable medium, for directing the database and processing system to estimate a size of a GROUP-BY; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to select, based on the estimated size, whether the means for directing to compute the GROUP-BY is operable using a hash-based method including producing hash tables by (I) employing a multidimensional array, and (ii) concatenating a predetermined number of bits for each attribute.
  - 63. A computer program product as recited in claim 47, further comprising:
    - means, provided on the computer-readable medium, for directing the database and processing system to identify respective levels within the operator generated by the means for directing to generate an operator, the levels having a hierarchy based on the number of attributes in each GROUP-BY in the levels; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to determine, for each level having a successively smaller number of attributes per GROUP-BY, a lowest-cost way of computing the GROUP-BYs of that level, based on results of computations of GROUP-BYs for a previous level whose GROUP-BYs have more than one greater number of attributes.
  - 64. A computer program product as recited in claim 47, wherein:
    - the computer program product is operable responsive to a received query which requests a subset of all possible GROUP-BY operations; and
      
      the means for directing to generating an operator includes means, provided on the computer-readable medium, for directing the database and processing system to generate GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset.
  - 65. A computer program product as recited in claim 64, wherein the means for directing to generate GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset includes:
    - means, provided on the computer-readable medium, for directing the database and processing system to generate an intermediate GROUP-BY operation not included in the requested subset; and
      
      means, provided on the computer-readable medium, for directing the database and processing system to generate one of the requested GROUP-BY operations from the intermediate GROUP-BY operation.
  - 66. A computer program product as recited in claim 64, wherein the means for directing to generate GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset includes generating a minimal cost Steiner tree.
  - 67. A computer program product as recited in claim 64, wherein the means for directing to generate GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset is performed using a sort-based method.
  - 68. A computer program product as recited in claim 64, wherein the means for directing to generate GROUP-BY operations based on whether the GROUP-BY operations are within the requested subset is performed using a hash-based method.
  - 69. A computer program product as recited in claim 47, wherein:
    - one of the attributes of the dataset has a hierarchy of levels defined on it;
      
      the computer program product is operative responsive to a received query which requests GROUP-BY operations in which different designated levels of the hierarchy for the attribute are requested; and
      
      the means for directing to generate an operator includes means, provided on the computer-readable medium, for directing the database and processing system to use an optimization technique of identifying, for a given one of the GROUP-BY operations including a higher level of the hierarchy for the attribute having a hierarchy of levels, a parent one of the GROUP-BY operations which includes a lower level of the hierarchy for the attribute having a hierarchy of levels, from which the given GROUP-BY operation may be computed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Sarawagi, Sunita, Agrawal, Rakesh, Gupta, Ashish
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
Coby, Frantz

Application Number

US08/624,283
Time in Patent Office

949 Days
Field of Search

385/602, 385/607, 382/49, 707/1, 707/2, 707/3, 707/4
US Class Current

1/1
CPC Class Codes

G06F 16/24556   Aggregation; Duplicate elim...

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99937   Sorting

Database system and method employing data cube operator for group-by operations

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

188 Citations

69 Claims

Specification

Solutions

Use Cases

Quick Links

Database system and method employing data cube operator for group-by operations

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

188 Citations

69 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links