Sampling for aggregation queries

US 6,842,753 B2
Filed: 01/12/2001
Issued: 01/11/2005
Est. Priority Date: 01/12/2001
Status: Active Grant

First Claim

Patent Images

1. A method that computes an approximate result to an aggregation query on a relation with at least one attribute comprising:

identifying outlier tuples in the relation having an attribute value that meets an outlier criteria that is a variance of the attribute value in each tuple with respect to the other tuples in the relation by;

sorting the tuples based on the tuple value for the attribute;

determining a minimum number of tuples to be classified as outliers;

determining a set of contiguous sorted tuples that includes at least the minimum number of tuples and for which the variance is minimized; and

classifying tuples not in the set of contiguous sorted tuples as outliers;

executing the query on the identified outlier tuples to obtain an outlier result;

estimating a non-outlier contribution of non-outlier tuples to the result of the query; and

combining the outlier result and the non-outlier contribution.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Aggregation queries are performed by first identifying outlier values, aggregating the outlier values, and sampling the remaining data after pruning the outlier values. The sampled data is extrapolated and added to the aggregated outlier values to provide an estimate for each aggregation query. Outlier values are identified by selecting values outside of a selected sliding window of data having the lowest variance. An index is created for the outlier values. The outlier data is removed from the window of data, and separately aggregated. The remaining data without the outliers is then sampled in one of many known ways to provide a statistically relevant sample that is then aggregated and extrapolated to provide an estimate for the remaining data. This sampled estimate is combined with the outlier aggregate to form an estimate for the entire set of data. Further methods involve the use of weighted sampling and weighted selection of outlier values for low selectivity queries, or queries having group by.

Citations

26 Claims

1. A method that computes an approximate result to an aggregation query on a relation with at least one attribute comprising:
- identifying outlier tuples in the relation having an attribute value that meets an outlier criteria that is a variance of the attribute value in each tuple with respect to the other tuples in the relation by;
  
  sorting the tuples based on the tuple value for the attribute;
  
  determining a minimum number of tuples to be classified as outliers;
  
  determining a set of contiguous sorted tuples that includes at least the minimum number of tuples and for which the variance is minimized; and
  
  classifying tuples not in the set of contiguous sorted tuples as outliers;
  
  executing the query on the identified outlier tuples to obtain an outlier result;
  
  estimating a non-outlier contribution of non-outlier tuples to the result of the query; and
  
  combining the outlier result and the non-outlier contribution.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 comprising constructing an outlier index that records a location of outlier tuples in the relation.
  - 3. The method of claim 2 wherein the results of multiple queries are approximated and comprising maintaining a union of the outlier indexes.
  - 4. The method of claim 1 wherein the non-outlier contribution is estimated by executing the query on a sample of the non-outlier tuples.
  - 5. The method of claim 4 wherein the sample of non-outlier tuples is compiled by performing a random sample of the non-outlier tuples.
  - 6. The method of claim 4 wherein the sample of non-outliers is compiled by including a tuple in the sample based on a probability that the tuple has been accessed by a previous user query.
  - 7. The method of claim 1 wherein the outlier criteria is based on a type of query that generates the result being approximated.
  - 8. The method of claim 1 wherein the variance is determined by calculating a standard deviation of the set of contiguous sorted tuples.
  - 9. The method of claim 1 comprising estimating an error associated with the approximated query result and executing the query on the relation when the estimated error is above a predetermined threshold.
  - 10. The method of claim 1 wherein the query divides the relation into sub-relations and wherein outlier tuples are identified for each sub-relation.
  - 11. The method of claim 10 comprising constructing an outlier index for each sub-relation that records a location of outlier tuples in the relation.
  - 12. The method of claim 11 comprising calculating a variance for the attribute values for the tuples in each sub-relation and wherein outlier indexes are stored for those sub-relations having a variance above a threshold variance.

13. A method for computing an approximate result to an aggregation query on a relation with at least one attribute comprising:
- analyzing the workload to trace tuple usage by queries in the workload;
  
  identifying outlier tuples in the relation having an attribute value that meets an outlier criteria;
  
  estimating a non-outlier contribution of non-outlier tuples to the result of the query by executing the query on a sample of non-outlier tuples that is weighted based on tuple usage;
  
  executing the query on the identified outlier tuples to obtain an outlier result by accessing an index of outlier tuples that have a relatively high tuple usage; and
  
  combining the outlier result and the non-outlier contribution.

14. A computer readable medium comprising computer executable instructions for performing a method that approximates a result to an aggregation query on a relation with at least one attribute comprising:
- identifying outlier tuples in the relation having an attribute value that meets an outlier criteria that is a variance of the attribute value in each tuple with respect to the other tuples in the relation by;
  
  sorting the tuples based on the tuple value for the attribute;
  
  determining a minimum number of tuples to be classified as outliers;
  
  determining a set of contiguous sorted tuples that includes at least the minimum number of tuples and for which the variance is minimized; and
  
  classifying tuples not in the set of contiguous sorted tuples as outliers;
  
  executing the query on the identified outlier tuples to obtain an outlier result;
  
  estimating a non-outlier contribution of non-outlier tuples to the result of the query; and
  
  combining the outlier result and the non-outlier contribution.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The computer readable medium of claim 14 comprising constructing an outlier index that records a location of outlier tuples in the relation.
  - 16. The computer readable medium of claim 14 wherein the contribution of non-outlier tuples to the result of the query is estimated by executing the query on a sample of the non-outlier tuples.
  - 17. The computer readable medium of claim 16 wherein the sample of non-outlier tuples is compiled by performing a random sample of the non-outlier tuples.
  - 18. The computer readable medium of claim 16 wherein the sample of non-outlier tuples is compiled by including a tuple in the sample based on a probability that the tuple has been accessed by a previous user query.
  - 19. The computer readable medium of claim 14 wherein the variance is determined by calculating a standard deviation of the set of contiguous sorted tuples.
  - 20. The computer readable medium of claim 14 wherein the query divides the relation into sub-relations and wherein outlier tuples are selected for each sub-relation.

21. A computer readable medium comprising computer executable instructions for performing a method for approximating a result to an aggregation query on a relation with at least one attribute comprising:
- analyzing the workload to trace tuple usage by queries in the workload;
  
  identifying outlier tuples in the relation having an attribute value that meets an outlier criteria;
  
  estimating a non-outlier contribution of non-outlier tuples to the result of the query by executing the query on a sample of non-outlier tuples that is weighted based on tuple usage;
  
  executing the query on the identified outlier tuples to obtain an outlier result by accessing an index of outlier tuples that have a relatively high tuple usage; and
  
  combining the outlier result and the non-outlier contribution.

22. A system for computing an approximate result to an aggregation query on a database comprising:
- one or more computers that store data that is organized according to a hierarchy of related fields in one or more relations;
  
  a database management system including a processor that selectively extracts tuples from the database relations and including processor components that evaluate the contents of the tuples; and
  
  the processor comprising a query result approximation component comprising i) an outlier identification module that identifies outlier tuples that have an attribute value that meets an outlier criteria;
  
  wherein the outlier identification module sorts the tuples based on the tuple value for the attribute, determines a minimum number of tuples to be classified as outliers;
  
  determines a set of contiguous sorted tuples that includes at least the minimum number of tuples and from which the variance is minimized, and classifies the tuples not in the set of contiguous sorted tuples as outliers;
  
  ii) an outlier query execution module that executes the query on the identified outlier tuples;
  
  iii) a non-outlier query contribution module that estimates the contribution of the non-outlier tuples to the result of the query by executing the query on a sample of non-outlier tuples; and
  
  iv) a result aggregation module that combines a result of the query executed on the outlier tuples and the estimated result.
- View Dependent Claims (23, 24, 25, 26)
- - 23. The system of claim 22 wherein the outlier identification module comprises an index that stores a location for each identified outlier tuple.
  - 24. The system of claim 22 wherein the outlier identification module identifies outliers by determining the variance of the value of the attribute of a given tuple with respect to the other tuples in the relation.
  - 25. The system of claim 22 wherein the non-outlier contribution estimation module accesses a sample of the non-outliers to estimate the contribution of non-outlier tuples to the result of the query.
  - 26. The system of claim 25 wherein tuples are included in the sample based on a probability that the tuple has been accessed by a previous user query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Datar, Mayur D., Narasayya, Vivek R., Motwani, Rajeev
Primary Examiner(s)
ALAM, SHAHID AL

Application Number

US09/759,799
Publication Number

US 20020124001A1
Time in Patent Office

1,460 Days
Field of Search

707/2, 707/3, 707/5, 707/10, 707/100, 707/102, 707/1, 707/6, 707/101, 707/4, 706/45, 706/62, 714/25, 702/97
US Class Current

1/1
CPC Class Codes

G06F 16/24556   Aggregation; Duplicate elim...

G06F 16/2462   Approximate or statistical ...

G06F 2216/03   Data mining

Y10S 707/957   Multidimensional

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Sampling for aggregation queries

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Sampling for aggregation queries

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links