METHODS AND SYSTEMS TO OPERATE ON GROUP-BY SETS WITH HIGH CARDINALITY
First Claim
1. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the storage medium having instructions stored thereon, and the instructions being operable to cause a data-processing apparatus to perform operations including:
- accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables;
grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets are subsets of the data set;
determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset;
generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset;
initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and
generating multiple statistical summaries of the data set using the objects.
1 Assignment
0 Petitions
Accused Products
Abstract
This disclosure describes methods, systems, computer-readable media, and apparatuses for efficiently calculating group-by statistics. A data set that includes multiple entries is accessed. The multiple entries are grouped into group-by subsets which are formed on two or more group-by variables and which are subsets are subsets of the data set. Cardinality data is determined for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset. At least one summary of data in each of the group-by subsets is generated, wherein each of the summaries includes the cardinality data determined for the group-by subset. Objects for the group-by subsets are initialized such that the objects store the summaries. The objects may then be used to generate multiple statistical summaries of the data set.
26 Citations
36 Claims
-
1. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the storage medium having instructions stored thereon, and the instructions being operable to cause a data-processing apparatus to perform operations including:
-
accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables; grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets are subsets of the data set; determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset; generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset; initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and generating multiple statistical summaries of the data set using the objects. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system comprising:
-
one or more processors configured to perform operations that include; accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables; grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets are subsets of the data set; determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset; generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset; initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and generating multiple statistical summaries of the data set using the objects. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A method comprising:
-
accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables; grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets are subsets of the data set; determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset; generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset; initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and generating multiple statistical summaries of the data set using the objects. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
Specification