Sampling for aggregation queries
First Claim
1. A method that computes an approximate result to an aggregation query on a relation with at least one attribute comprising:
- identifying outlier tuples in the relation having an attribute value that meets an outlier criteria that is a variance of the attribute value in each tuple with respect to the other tuples in the relation by;
sorting the tuples based on the tuple value for the attribute;
determining a minimum number of tuples to be classified as outliers;
determining a set of contiguous sorted tuples that includes at least the minimum number of tuples and for which the variance is minimized; and
classifying tuples not in the set of contiguous sorted tuples as outliers;
executing the query on the identified outlier tuples to obtain an outlier result;
estimating a non-outlier contribution of non-outlier tuples to the result of the query; and
combining the outlier result and the non-outlier contribution.
2 Assignments
0 Petitions
Accused Products
Abstract
Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled in one of many known ways to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data. Further methods involve the use of weighted sampling and weighted selection of outlier values for low selectivity queries, or queries having group by.
-
Citations
26 Claims
-
1. A method that computes an approximate result to an aggregation query on a relation with at least one attribute comprising:
-
identifying outlier tuples in the relation having an attribute value that meets an outlier criteria that is a variance of the attribute value in each tuple with respect to the other tuples in the relation by;
sorting the tuples based on the tuple value for the attribute;
determining a minimum number of tuples to be classified as outliers;
determining a set of contiguous sorted tuples that includes at least the minimum number of tuples and for which the variance is minimized; and
classifying tuples not in the set of contiguous sorted tuples as outliers;
executing the query on the identified outlier tuples to obtain an outlier result;
estimating a non-outlier contribution of non-outlier tuples to the result of the query; and
combining the outlier result and the non-outlier contribution. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for computing an approximate result to an aggregation query on a relation with at least one attribute comprising:
-
analyzing the workload to trace tuple usage by queries in the workload;
identifying outlier tuples in the relation having an attribute value that meets an outlier criteria;
estimating a non-outlier contribution of non-outlier tuples to the result of the query by executing the query on a sample of non-outlier tuples that is weighted based on tuple usage;
executing the query on the identified outlier tuples to obtain an outlier result by accessing an index of outlier tuples that have a relatively high tuple usage; and
combining the outlier result and the non-outlier contribution.
-
-
14. A computer readable medium comprising computer executable instructions for performing a method that approximates a result to an aggregation query on a relation with at least one attribute comprising:
-
identifying outlier tuples in the relation having an attribute value that meets an outlier criteria that is a variance of the attribute value in each tuple with respect to the other tuples in the relation by;
sorting the tuples based on the tuple value for the attribute;
determining a minimum number of tuples to be classified as outliers;
determining a set of contiguous sorted tuples that includes at least the minimum number of tuples and for which the variance is minimized; and
classifying tuples not in the set of contiguous sorted tuples as outliers;
executing the query on the identified outlier tuples to obtain an outlier result;
estimating a non-outlier contribution of non-outlier tuples to the result of the query; and
combining the outlier result and the non-outlier contribution. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
-
21. A computer readable medium comprising computer executable instructions for performing a method for approximating a result to an aggregation query on a relation with at least one attribute comprising:
-
analyzing the workload to trace tuple usage by queries in the workload;
identifying outlier tuples in the relation having an attribute value that meets an outlier criteria;
estimating a non-outlier contribution of non-outlier tuples to the result of the query by executing the query on a sample of non-outlier tuples that is weighted based on tuple usage;
executing the query on the identified outlier tuples to obtain an outlier result by accessing an index of outlier tuples that have a relatively high tuple usage; and
combining the outlier result and the non-outlier contribution.
-
-
22. A system for computing an approximate result to an aggregation query on a database comprising:
-
one or more computers that store data that is organized according to a hierarchy of related fields in one or more relations;
a database management system including a processor that selectively extracts tuples from the database relations and including processor components that evaluate the contents of the tuples; and
the processor comprising a query result approximation component comprising i) an outlier identification module that identifies outlier tuples that have an attribute value that meets an outlier criteria;
wherein the outlier identification module sorts the tuples based on the tuple value for the attribute, determines a minimum number of tuples to be classified as outliers;
determines a set of contiguous sorted tuples that includes at least the minimum number of tuples and from which the variance is minimized, and classifies the tuples not in the set of contiguous sorted tuples as outliers;
ii) an outlier query execution module that executes the query on the identified outlier tuples;
iii) a non-outlier query contribution module that estimates the contribution of the non-outlier tuples to the result of the query by executing the query on a sample of non-outlier tuples; and
iv) a result aggregation module that combines a result of the query executed on the outlier tuples and the estimated result. - View Dependent Claims (23, 24, 25, 26)
-
Specification