Methods and systems to operate on group-by sets with high cardinality
First Claim
1. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the storage medium having instructions stored thereon, and the instructions being operable to cause a data-processing apparatus to perform operations including:
- accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables;
grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets include multiple disjoint subsets of the data set, multiple intersecting subsets of the data set, or multiple subsets of the data set which are formed on different combinations of group-by variables;
displaying an interface that facilitates defining a subset of the data set by referencing one or more of the group-by subsets;
receiving an input at the interface, the input defining a subset of the data set by referencing at least one of the group-by subsets;
generating a statistical summary of the defined subset;
determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset;
generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset;
initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and
generating multiple statistical summaries of the data set using the objects.
1 Assignment
0 Petitions
Accused Products
Abstract
This disclosure describes methods, systems, computer-readable media, and apparatuses for efficiently calculating group-by statistics. A data set that includes multiple entries is accessed. The multiple entries are grouped into group-by subsets which are formed on two or more group-by variables and which are subsets are subsets of the data set. Cardinality data is determined for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset. At least one summary of data in each of the group-by subsets is generated, wherein each of the summaries includes the cardinality data determined for the group-by subset. Objects for the group-by subsets are initialized such that the objects store the summaries. The objects may then be used to generate multiple statistical summaries of the data set.
15 Citations
30 Claims
-
1. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the storage medium having instructions stored thereon, and the instructions being operable to cause a data-processing apparatus to perform operations including:
-
accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables; grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets include multiple disjoint subsets of the data set, multiple intersecting subsets of the data set, or multiple subsets of the data set which are formed on different combinations of group-by variables; displaying an interface that facilitates defining a subset of the data set by referencing one or more of the group-by subsets; receiving an input at the interface, the input defining a subset of the data set by referencing at least one of the group-by subsets; generating a statistical summary of the defined subset; determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset; generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset; initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and generating multiple statistical summaries of the data set using the objects. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more processors configured to perform operations that include; accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables; grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets include multiple disjoint subsets of the data set, multiple intersecting subsets of the data set, or multiple subsets of the data set which are formed on different combinations of group-by variables; displaying an interface that facilitates defining a subset of the data set by referencing one or more of the group-by subsets; receiving an input at the interface, the input defining a subset of the data set by referencing at least one of the group-by subsets; generating a statistical summary of the defined subset; determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset; generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset; initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and generating multiple statistical summaries of the data set using the objects. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method comprising:
-
accessing a data set that includes multiple entries, each of the entries including data corresponding to multiple variables; grouping the multiple entries into group-by subsets, wherein the group-by subsets are formed on two or more group-by variables, and wherein the group-by subsets include multiple disjoint subsets of the data set, multiple intersecting subsets of the data set, or multiple subsets of the data set which are formed on different combinations of group-by variables; displaying an interface that facilitates defining a subset of the data set by referencing one or more of the group-by subsets; receiving an input at the interface, the input defining a subset of the data set by referencing at least one of the group-by subsets; generating a statistical summary of the defined subset; determining cardinality data for each of the group-by subsets, wherein cardinality data represents a number of entries in a group-by subset; generating at least one summary of data in each of the group-by subsets, wherein each of the summaries includes the cardinality data determined for the group-by subset; initializing objects for the group-by subsets, wherein each of the objects include the cardinality data and the at least one summary, and wherein each of the objects includes values of the group-by variables used in forming the group-by subset; and generating multiple statistical summaries of the data set using the objects. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification