System and method for multiple distinct aggregate queries
First Claim
1. A data processor implemented method of executing multiple distinct aggregate type queries, comprising:
- providing at least one Counting Bloom Filter for each distinct column of an input data stream;
reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column;
using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing;
in a build phase, during execution of a group by operation, creating for each group of tuples a Counting Bloom Filter for each distinct column, hashing the values of each distinct column into their respective Counting Bloom Filters, and incrementing the Counting Bloom Filter bit settings; and
in a probe phase, once the group by operation is finished, reviewing the count values in the Counting Bloom Filter for each distinct column, and if the count value for a distinct column is greater than one, then sending the identified duplicate values to the distinct hash operator,wherein in the probe phase, reviewing the count values in the Counting Bloom Filter comprises;
for each tuple in a group, querying the values of the distinct columns in their respective Counting Bloom Filters;
when the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator;
probing a value into a distinct hash table if the value is not bypassed, andwhen a match is found, discarding the value;
when a match is not found, turning the probing into an insertion;
after finishing the processing of the tuples in group, traversing the distinct hash tables and flowing the distinct values up to the aggregate operator, andwherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the method further comprises;
sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and
when the estimated number of incoming distinct values for the next group is different, then resizing the hash table in dependence upon the estimated number.
4 Assignments
0 Petitions
Accused Products
Abstract
There is disclosed a system and method for executing multiple distinct aggregate queries. In an embodiment, the method comprises: providing at least one Counting Bloom Filter for each distinct column of an input data stream; reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column; and if necessary, using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing. Also, the use of Counting Bloom Filters for monitoring data streams allow an early duplicate removal of the input stream of data, resulting in savings in computation time and memory resources.
60 Citations
12 Claims
-
1. A data processor implemented method of executing multiple distinct aggregate type queries, comprising:
-
providing at least one Counting Bloom Filter for each distinct column of an input data stream; reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column; using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing; in a build phase, during execution of a group by operation, creating for each group of tuples a Counting Bloom Filter for each distinct column, hashing the values of each distinct column into their respective Counting Bloom Filters, and incrementing the Counting Bloom Filter bit settings; and in a probe phase, once the group by operation is finished, reviewing the count values in the Counting Bloom Filter for each distinct column, and if the count value for a distinct column is greater than one, then sending the identified duplicate values to the distinct hash operator, wherein in the probe phase, reviewing the count values in the Counting Bloom Filter comprises; for each tuple in a group, querying the values of the distinct columns in their respective Counting Bloom Filters; when the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator; probing a value into a distinct hash table if the value is not bypassed, and when a match is found, discarding the value; when a match is not found, turning the probing into an insertion; after finishing the processing of the tuples in group, traversing the distinct hash tables and flowing the distinct values up to the aggregate operator, and wherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the method further comprises; sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and when the estimated number of incoming distinct values for the next group is different, then resizing the hash table in dependence upon the estimated number. - View Dependent Claims (2, 3, 4)
-
-
5. A system for executing multiple distinct aggregate type queries, comprising:
-
at least one Counting Bloom Filter for each distinct column of an input data stream; means for reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column; a distinct hash operator for removing duplicates, from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing, means for creating in a build phase, during execution of a group by operation, for each group of tuples, a Counting Bloom Filter for each distinct column; means for reviewing in a probe phase, once the group by operation is finished, the count values in the Counting Bloom Filter for each distinct column, and if the count value for a distinct column is greater than one, then sending the identified duplicate values to the distinct hash operator; wherein the system is adapted such that, in the build phase, when executing the group by operation, the values of each distinct column are hashed into their respective Counting Bloom Filters, and the Counting Bloom Filter bit settings are incremented; the system further comprising; means for querying the values of the distinct columns in their respective Counting Bloom Filters for each tuple in a group, and when the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator; means for probing a value into a hash table if the value is not bypassed, and when a match is found, discarding the value; when a match is not found, turning the probing into an insertion; means for traversing the hash tables after finishing the processing of the tuples in group, and flowing the distinct values up to the aggregate operator, wherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the system further comprises; means for sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and means for resizing the hash table in dependence upon the estimated number when the estimated number of incoming distinct values for the next group is different. - View Dependent Claims (6, 7, 8)
-
-
9. A data processor readable medium storing data processor code that when loaded into a data processing device adapts the device to perform a method of executing multiple distinct aggregate type queries, the data processor readable medium comprising:
-
code for providing at least one Counting Bloom Filter for each distinct column of an input data stream; code for reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column; code for using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing; code for creating, in a build phase during execution of a group by operation, for each group of tuples, a Counting Bloom Filter for each distinct column; code for reviewing, in a probe phase, once the group by operation is finished, the count values in the Counting Bloom Filter for each distinct column and sending the values to an aggregation operator after discarding duplicate values; code for hashing in the build phase, when executing the group by operation, the values of each distinct column into their respective Counting Bloom Filters, and incrementing the Counting Bloom Filter bit settings; code for reviewing, in the probe phase, the count values in the Counting Bloom Filter; code for querying the values of the distinct columns in their respective Counting Bloom Filters for each tuple in a group, and when the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator; code for probing a value into a distinct hash table if the value is not bypassed, and when a match is found, discarding the value; when a match is not found, turning the probing into an insertion; code for traversing the distinct hash tables after finishing the processing of the tuples in group, and flowing the distinct values up to the aggregate operator, wherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the data processor readable medium further comprises; code for sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and code for resizing the hash table in dependence upon the estimated number when the estimated number of incoming distinct values for the next group is different. - View Dependent Claims (10, 11, 12)
-
Specification