Hash-based database grouping system and method
First Claim
1. A method for performing a hash grouping operation on a relational database table, said table comprising rows and columns, said columns including at least group column and zero or more data columns, said method comprising the steps of:
- (1) reading a row of said relational database table, said row comprising a identifier from said at least one group column and zero or more data fields to said zero or more data columns;
(2) applying a hash function to said group identifier to generate a hashed group value, said hashed group value serving as an index into a memory-resident hash table, said hash table mapping hashed group values into corresponding memory-resident group table entries, each group table entry including group data fields that store for a single group aggregated raw data from said data fields, a group identifier and housekeeping data;
(3) determining from contents of said hash table whether a matching group table entry exists for a group defined by said group identifier;
(4) when said matching group table entry exists, aggregating contents of said data fields into corresponding group data fields of said group table entry and updating said housekeeping data;
(5) when said matching group table entry does not exist and available memory meets predefined memory availability criteria, allocating an additional group table entry, writing into said additional group table entry said group identifier and said data fields, and initializing said housekeeping data.(6) when said matching group table entry does not exist and available memory does not meet said predefined memory availability criteria, applying an overflow procedure to said row wherein contents of said row are used to update an overflow file.(7) repeating steps (1) through (6) until all rows of said table have been read;
(8) reporting said group table entries;
(9) when at least one of said rows were written to said overflow file, repeating steps (1) through (8), except in step (1) reading said overflow file rather than said relational database table.
4 Assignments
0 Petitions
Accused Products
Abstract
A structured query language (SQL) grouping and aggregation system and method that incorporates hash-based techniques, several overflow handling strategies and statistics-based process-selection criteria. The method can execute SQL group-by queries on distributed database tables or tables stored locally to the database management system (DBMS) processor executing the grouping method. Hash-based techniques allow groupings and aggregates to be generated on the fly through the use of partial aggregates maintained in primary memory. Where primary memory is limited, groups and aggregates are still generated for as many groups as can be maintained in primary memory, while various overflow procedures are provided for buffering ungrouped data and writing that data to an overflow disk file for later processing. In one overflow procedure, raw data from groups that cannot be aggregated in primary memory are buffered then written to the overflow disk file. In a second overflow procedure, ungroupable raw data is formatted the same as data being aggregated in the group table, buffered, and then written to the overflow file. In a third overflow procedure, ungroupable raw data is partially aggregated in an output buffer maintained in primary memory before being written to the overflow file maintained in secondary memory. Database table statistics maintained by a cataloger are consulted to determine whether hash-based grouping or conventional sort based grouping should be used to execute a group-by query. The system is adaptable to running a grouping query against a partitioned database on distributed processors.
-
Citations
8 Claims
-
1. A method for performing a hash grouping operation on a relational database table, said table comprising rows and columns, said columns including at least group column and zero or more data columns, said method comprising the steps of:
-
(1) reading a row of said relational database table, said row comprising a identifier from said at least one group column and zero or more data fields to said zero or more data columns; (2) applying a hash function to said group identifier to generate a hashed group value, said hashed group value serving as an index into a memory-resident hash table, said hash table mapping hashed group values into corresponding memory-resident group table entries, each group table entry including group data fields that store for a single group aggregated raw data from said data fields, a group identifier and housekeeping data; (3) determining from contents of said hash table whether a matching group table entry exists for a group defined by said group identifier; (4) when said matching group table entry exists, aggregating contents of said data fields into corresponding group data fields of said group table entry and updating said housekeeping data; (5) when said matching group table entry does not exist and available memory meets predefined memory availability criteria, allocating an additional group table entry, writing into said additional group table entry said group identifier and said data fields, and initializing said housekeeping data. (6) when said matching group table entry does not exist and available memory does not meet said predefined memory availability criteria, applying an overflow procedure to said row wherein contents of said row are used to update an overflow file. (7) repeating steps (1) through (6) until all rows of said table have been read; (8) reporting said group table entries; (9) when at least one of said rows were written to said overflow file, repeating steps (1) through (8), except in step (1) reading said overflow file rather than said relational database table. - View Dependent Claims (2, 3, 4)
-
-
5. A database management system (DBMS) including:
-
(a) a secondary memory including a relational database table comprising rows and columns, said columns including at least one group column and zero or more data columns, an overflow file and compiled database routines, including a grouping function and a hash function; (b) a primary memory .including a group table and a hash table, said group table comprising a plurality of group table entries each storing for a single group aggregated member data, a group name and housekeeping data, and said hash table comprising a plurality of hash table entries each storing the address of a corresponding group table entry, said address being retrievable from said hash table at an index generated by applying said hash function to said group name of said corresponding group entry; (c) a communications interface through which user database queries are relayed to said DBMS from a user workstation, and through which query results from said DBMS are made available to said user workstation; (d) a processor that controls interactions between said primary memory, said secondary memory and said communications interface, and executes said compiled database routines, including said grouping function in response to a group by query received from said communications interface;
wherein said grouping function comprises;(i) an input procedure that reads said database table row by row and outputs for each row read a member row comprising a group identifier from said at least one group column and zero or more data fields corresponding to said zero or more data columns; (ii) a matching procedure that calls said hash function, said hash function generating a hashed group value from said group identifier; and
determines from contents of a hash table entry indexed by said hashed group value whether a matching group table entry exists with said group name identical to said group identifier;(iii) an aggregation procedure that executes an aggregation function specified by said group by query, said aggregation procedure comprising a set of instructions for updating said matching group table entry with data from said member row; (iv) an initialization procedure that allocates a new group table entry from said primary memory, initializes said new group table entry based on data from said member row and sets a pointer to said new group table entry in said hash table whenever said matching group table entry does not exist and predetermined memory availability criteria indicate there is sufficient primary memory for a new group table entry; (v) an overflow procedure that generates an overflow row from said member row whenever said matching group table entry does not exist and said memory availability criteria indicate there is not sufficient primary memory for a new group table entry, said overflow procedure also updating said overflow file based on contents of said overflow row; and (vi) an overflow input procedure that reads data from said overflow file when all rows of said input table have been read and said overflow procedure has been executed, said overflow file reflecting data from all member rows not aggregated in said group table. - View Dependent Claims (6, 7, 8)
-
Specification