System and method for multiple distinct aggregate queries

US 8,005,868 B2
Filed: 03/07/2008
Issued: 08/23/2011
Est. Priority Date: 03/07/2008
Status: Active Grant

First Claim

Patent Images

1. A data processor implemented method of executing multiple distinct aggregate type queries, comprising:

providing at least one Counting Bloom Filter for each distinct column of an input data stream;

reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column;

using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing;

in a build phase, during execution of a group by operation, creating for each group of tuples a Counting Bloom Filter for each distinct column, hashing the values of each distinct column into their respective Counting Bloom Filters, and incrementing the Counting Bloom Filter bit settings; and

in a probe phase, once the group by operation is finished, reviewing the count values in the Counting Bloom Filter for each distinct column, and if the count value for a distinct column is greater than one, then sending the identified duplicate values to the distinct hash operator,wherein in the probe phase, reviewing the count values in the Counting Bloom Filter comprises;

for each tuple in a group, querying the values of the distinct columns in their respective Counting Bloom Filters;

when the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator;

probing a value into a distinct hash table if the value is not bypassed, andwhen a match is found, discarding the value;

when a match is not found, turning the probing into an insertion;

after finishing the processing of the tuples in group, traversing the distinct hash tables and flowing the distinct values up to the aggregate operator, andwherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the method further comprises;

sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and

when the estimated number of incoming distinct values for the next group is different, then resizing the hash table in dependence upon the estimated number.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

There is disclosed a system and method for executing multiple distinct aggregate queries. In an embodiment, the method comprises: providing at least one Counting Bloom Filter for each distinct column of an input data stream; reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column; and if necessary, using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing. Also, the use of Counting Bloom Filters for monitoring data streams allow an early duplicate removal of the input stream of data, resulting in savings in computation time and memory resources.

60 Citations

View as Search Results

12 Claims

1. A data processor implemented method of executing multiple distinct aggregate type queries, comprising:
- providing at least one Counting Bloom Filter for each distinct column of an input data stream;
  
  reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column;
  
  using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing;
  
  in a build phase, during execution of a group by operation, creating for each group of tuples a Counting Bloom Filter for each distinct column, hashing the values of each distinct column into their respective Counting Bloom Filters, and incrementing the Counting Bloom Filter bit settings; and
  
  in a probe phase, once the group by operation is finished, reviewing the count values in the Counting Bloom Filter for each distinct column, and if the count value for a distinct column is greater than one, then sending the identified duplicate values to the distinct hash operator,wherein in the probe phase, reviewing the count values in the Counting Bloom Filter comprises;
  
  for each tuple in a group, querying the values of the distinct columns in their respective Counting Bloom Filters;
  
  when the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator;
  
  probing a value into a distinct hash table if the value is not bypassed, andwhen a match is found, discarding the value;
  
  when a match is not found, turning the probing into an insertion;
  
  after finishing the processing of the tuples in group, traversing the distinct hash tables and flowing the distinct values up to the aggregate operator, andwherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the method further comprises;
  
  sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and
  
  when the estimated number of incoming distinct values for the next group is different, then resizing the hash table in dependence upon the estimated number.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the estimated number is calculated by counting the number of values in the respective Counter Bloom Filters that are greater than one.
  - 3. The method of claim 1, further comprising providing a stop condition in the query execution plan in order to pass through the input data stream and perform a build phase.
  - 4. The method of claim 1, further comprising:
    - in the absence of a group by operation having a stop condition, obtaining a cost decision based on estimations made by an optimizer, including avoiding the build phase such that all values are processed by the distinct hash operator; and
      
      introducing an additional stop condition in the query execution plan such that all of the input stream is passed over during the build phase.

5. A system for executing multiple distinct aggregate type queries, comprising:
- at least one Counting Bloom Filter for each distinct column of an input data stream;
  
  means for reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column;
  
  a distinct hash operator for removing duplicates, from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing,means for creating in a build phase, during execution of a group by operation, for each group of tuples, a Counting Bloom Filter for each distinct column;
  
  means for reviewing in a probe phase, once the group by operation is finished, the count values in the Counting Bloom Filter for each distinct column, and if the count value for a distinct column is greater than one, then sending the identified duplicate values to the distinct hash operator;
  
  wherein the system is adapted such that, in the build phase, when executing the group by operation, the values of each distinct column are hashed into their respective Counting Bloom Filters, and the Counting Bloom Filter bit settings are incremented;
  
  the system further comprising;
  
  means for querying the values of the distinct columns in their respective Counting Bloom Filters for each tuple in a group, andwhen the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator;
  
  means for probing a value into a hash table if the value is not bypassed, andwhen a match is found, discarding the value;
  
  when a match is not found, turning the probing into an insertion;
  
  means for traversing the hash tables after finishing the processing of the tuples in group, and flowing the distinct values up to the aggregate operator,wherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the system further comprises;
  
  means for sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and
  
  means for resizing the hash table in dependence upon the estimated number when the estimated number of incoming distinct values for the next group is different.
- View Dependent Claims (6, 7, 8)
- - 6. The system of claim 5, wherein the system is adapted to calculate the estimate number by counting the number of values in the respective Counter Bloom Filters that are greater than one.
  - 7. The system of claim 5, further comprising providing a stop condition in the query execution plan in order to pass through the input data stream and perform a build phase.
  - 8. The system of claim 5, further comprising:
    - means for obtaining a cost decision based on estimations made by an optimizer in the absence of a group by operation having a stop condition, including avoiding the build phase such that all values are processed by the distinct hash operator; and
      
      means for introducing an additional stop condition in the query execution plan such that all of the input stream is passed over during the build phase.

9. A data processor readable medium storing data processor code that when loaded into a data processing device adapts the device to perform a method of executing multiple distinct aggregate type queries, the data processor readable medium comprising:
- code for providing at least one Counting Bloom Filter for each distinct column of an input data stream;
  
  code for reviewing count values in the at least one Counting Bloom Filter for the existence of duplicates in each distinct column;
  
  code for using a distinct hash operator to remove duplicates from each distinct column of the input data stream, thereby removing the need for replicating the input data stream and minimizing distinct hash operator processing;
  
  code for creating, in a build phase during execution of a group by operation, for each group of tuples, a Counting Bloom Filter for each distinct column;
  
  code for reviewing, in a probe phase, once the group by operation is finished, the count values in the Counting Bloom Filter for each distinct column and sending the values to an aggregation operator after discarding duplicate values;
  
  code for hashing in the build phase, when executing the group by operation, the values of each distinct column into their respective Counting Bloom Filters, and incrementing the Counting Bloom Filter bit settings;
  
  code for reviewing, in the probe phase, the count values in the Counting Bloom Filter;
  
  code for querying the values of the distinct columns in their respective Counting Bloom Filters for each tuple in a group, andwhen the lowest of the counters for a given value is one, then determining that the value is unique, and passing the value to the aggregate operator and bypassing the distinct hash operator;
  
  code for probing a value into a distinct hash table if the value is not bypassed, andwhen a match is found, discarding the value;
  
  when a match is not found, turning the probing into an insertion;
  
  code for traversing the distinct hash tables after finishing the processing of the tuples in group, and flowing the distinct values up to the aggregate operator,wherein the distinct hash operator comprises a set of hash tables, one per distinct column, and the data processor readable medium further comprises;
  
  code for sizing the hash table depending on an estimated number of distinct values per group, per distinct column, that are not bypassed to the aggregate operator; and
  
  code for resizing the hash table in dependence upon the estimated number when the estimated number of incoming distinct values for the next group is different.
- View Dependent Claims (10, 11, 12)
- - 10. The data processor readable medium of claim 9, further comprising code for calculating the estimated number by counting the number of values in the respective Counter Bloom Filters that are greater than one.
  - 11. The data processor readable medium of claim 9, further comprising code for providing a stop condition in the query execution plan in order to pass through the input data stream and perform a build phase.
  - 12. The data processor readable medium of claim 9, further comprising:
    - code for obtaining, in the absence of a group by operation having a stop condition, a cost decision based on estimations made by an optimizer, including avoiding the build phase such that all values are processed by the distinct hash operator; and
      
      code for introducing an additional stop condition in the query execution plan such that all of the input stream is passed over during the build phase.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Domo, Inc.
Original Assignee
International Business Machines Corporation
Inventors
Sharpe, David C., Kandil, Mokhtar, Saborit, Josep Aguilar, Rielau, Serge Philippe, Flasza, Miroslaw Adam, Zuzarte, Calisto Paul
Primary Examiner(s)
Ali; Mohammad
Assistant Examiner(s)
Tran; Bao G

Application Number

US12/044,348
Publication Number

US 20090228433A1
Time in Patent Office

1,264 Days
Field of Search

707/999.002, 707/999.003, 707/999.005, 707/796, 707/706, 709/226, 711/216, 715/213
US Class Current

707/796
CPC Class Codes

G06F 16/24556 Aggregation; Duplicate elim...

System and method for multiple distinct aggregate queries

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

60 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for multiple distinct aggregate queries

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

60 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links