Parallel processing of count distinct values
First Claim
1. A method for performing a count distinct function on values in at least one column of data comprising:
- a) splitting the data into chunks based on the values in the at least one column of data upon which the count distinct function is to be performed, where no value appears in more than one chunk;
b) determining if each chunk is of a size that enables it to fit into available memory, and i) if not, recursively splitting the oversized chunks until each chunk is of a size that enables it to fit into available memory; and
c) performing an in memory count distinct function on each chunk and summing a number of distinct values from each chunk for display in at least one cell of a results grid.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for efficiently determining the number of distinct values in a column of source data is disclosed. Source data (e.g., source table) may be in the form of rows and columns that represent information. From the source table a count distinct function may be carried out to determine the number of distinct values in one or more columns of the source table. Results from an in memory count distinct function performed by a plurality of parallel query processors may be placed into a results grid. Another aspect of the invention relates to determining how many distinct values fall into each cell of the results grid.
24 Citations
18 Claims
-
1. A method for performing a count distinct function on values in at least one column of data comprising:
-
a) splitting the data into chunks based on the values in the at least one column of data upon which the count distinct function is to be performed, where no value appears in more than one chunk;
b) determining if each chunk is of a size that enables it to fit into available memory, and i) if not, recursively splitting the oversized chunks until each chunk is of a size that enables it to fit into available memory; and
c) performing an in memory count distinct function on each chunk and summing a number of distinct values from each chunk for display in at least one cell of a results grid. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for performing a count distinct function on values in at least one column of data from source data having at least one or more rows and one or more columns, comprising:
-
a) assigning a row of the source data to a cell in a grid b) creating a hash table based on a value in a column of the row and the cell assigned to the row;
c) splitting the hash table of cell-value pairs into chunks based on the values, where no value appears in more than one chunk;
b) determining if each chunk is of a size that enables it to fit into available memory, and i) if not, recursively splitting the oversized chunks until each chunk is of a size that enables it to fit into available memory; and
c) performing an in memory count distinct function on each chunk and summing a number of distinct values from each chunk for display in at least one cell of a results grid. - View Dependent Claims (7, 8, 9)
-
-
10. A relational database system having data storage and one or more processors for performing a count distinct function on values in at least one column of data comprising:
-
a) means for splitting the data into chunks based on the values in the column(s) of data upon which the count distinct function is to be performed so that no value appears in more than one chunk;
b) means for determining if each chunk is of a size that enables it to fit into available memory, and i) if not, recursively splitting the chunks until each chunk is of a size that enables it to fit into available memory; and
c) means for performing an in memory count distinct function on each chunk and summing a number of distinct values from each chunk for display in at least one cell of a results grid. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A relational database system having data storage and one or more processors for performing a count distinct function on values in at least one column of data from source data having at least one or more rows and one or more columns, comprising:
-
a) means for assigning a row of the source data to a cell in a grid b) means for creating a hash table based on a value in a column of the row and the cell assigned to the row;
c) means for splitting the hash table of cell-value pairs into chunks based on the values, where no value appears in more than one chunk;
b) means for determining if each chunk is of a size that enables it to fit into available memory, and i) if not, a means for recursively splitting the oversized chunks until each chunk is of a size that enables it to fit into available memory; and
c) means for performing an in memory count distinct function on each chunk and summing a number of distinct values from each chunk for display in at least one cell of a results grid. - View Dependent Claims (16, 17, 18)
-
Specification