Sampling for database systems
First Claim
1. A method for obtaining a sample in a single sequential pass of a plurality of records in a database system, the method comprising the steps of:
- (a) identifying the plurality of records and sampling semantics;
(b) select a sampling operator based on the sampling semantics wherein the sampling operator comprises a probability function that outputs a number of times an input tuple should be included in the sample to obtain a sample according to the sampling semantics; and
(c) obtaining a sample from the identified plurality of records using the sampling operator by;
(i) inputing each of the plurality of records the sampling operator; and
(ii) determining a number of times the record should be included in the sample based on the output of the sample operator; and
(iii) selectively outputting the record for inclusion in the sample the determined number of times.
3 Assignments
0 Petitions
Accused Products
Abstract
A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
78 Citations
65 Claims
-
1. A method for obtaining a sample in a single sequential pass of a plurality of records in a database system, the method comprising the steps of:
-
(a) identifying the plurality of records and sampling semantics;
(b) select a sampling operator based on the sampling semantics wherein the sampling operator comprises a probability function that outputs a number of times an input tuple should be included in the sample to obtain a sample according to the sampling semantics; and
(c) obtaining a sample from the identified plurality of records using the sampling operator by;
(i) inputing each of the plurality of records the sampling operator; and
(ii) determining a number of times the record should be included in the sample based on the output of the sample operator; and
(iii) selectively outputting the record for inclusion in the sample the determined number of times. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
wherein the selecting step (b) comprises the step of selecting a sampling operator whose probability function is based on the identified sample size.
-
-
4. The method of claim 1, wherein the identifying step (a) comprises the step of identifying a weight function to specify a weight for each record;
- and
wherein the selecting step (b) comprises the step of selecting a sampling operator whose probability function is based on the specified weight of each record.
- and
-
5. The method of claim 1, wherein the obtaining step (c) comprises the steps of:
-
(i) inputting one record from the plurality of records to the sampling operator in step (c)(i), (ii) selectively resetting one or more records of a reservoir to be the one record input to the sampling operator based on the output of the sampling operator in step (c)(ii), and (iii) repeating steps (c)(i) and (c)(ii) for other records of the plurality of records such that the records of the reservoir from the sample.
-
-
6. The method of claim 5, wherein the selectively resetting step (c) comprises the step of selectively resetting one or more records of the reservoir to be the one record obtained in step (i) with a probability based on a weight of the one record obtained in step (i) divided by a sum of weight(s) of record(s) that have been obtained for step (i).
-
7. The method of claim 5, wherein the selectively resetting step (c) comprises the step of selectively resetting a random record of the reservoir to be the one record obtained in step (i) based on a probability a number of time(s) equal in number to the weight of the one record obtained in step (i).
-
8. The method of claim 5, wherein the selectively resetting step (c) comprises the step of selectively resetting a random record of the reservoir to be the one record obtained in step (i) with a probability based on a number of records in the reservoir divided by a sum of record(s) evaluated for reset in the reservoir.
-
9. The method of claim 1, wherein at least one record obtained in step (a) may be output more than one time for step (c).
-
10. The method of claim 9, wherein the plurality of records are a relation produced as a stream of records as a result of a query.
-
11. The method of claim 9, wherein the plurality of records are materialized as a base relation in a database of the database system.
-
12. The method of claim 1, wherein the selectively outputting step (c)(iii) comprises the steps of
(i) determining a random number based on the probability such that the random number is greater than or equal to zero; - and
(ii) outputting the one record the determined number of times.
- and
-
13. The method of claim 1, wherein the sampling operator determines a random number from a binomial distribution based on the probability function.
-
14. The method of claim 1, wherein the sampling operator determines a random number such that the random number is less than or equal to a number of record(s) remaining to be output for inclusion in the sample.
-
15. The method of claim 1, wherein the step of the sampling operator determines a random number based on a probability based on a number of record(s) of the plurality of records to be evaluated for output for inclusion in the sample.
-
16. The method of claim 1, wherein the selectively outputting step (c)(iii) comprises the step of selectively outputting the one record one or more times based on a weight specified for the one record.
-
17. The method of claim 1, wherein the determining step (c)(ii) comprises the step of determining a random number based on a probability based on a weight of the one record divided by a sum of weight(s) of record(s) of the plurality of records to be evaluated for output for inclusion in the sample.
-
18. The method of claim 1, wherein the selectively outputting step (c)(iii) comprises the step of selectively outputting the one record based on a probability a number of time(s) equal in number to a weight of the one record.
-
19. The method of claim 18, wherein the selectively outputting step comprises the step of outputting the one record with a probability based on a number of record(s) remaining to be output for the sample divided by a number of possible record(s) that may be output for inclusion in the sample.
-
20. The method of claim 1, wherein the determining step (c)(ii) comprises the step of determining a random number such that the random number is less than or equal to a weight of the one record.
-
21. The method of claim 1, wherein the determining step (c)(ii) comprises the step of determining random number based on a probability based on a fraction of the plurality of records.
-
22. A computer readable medium having computer-executable instructions for performing a sequential sampling of records in one pass, the computer-executable instructions for performing the steps of:
-
(a) obtaining one record from a plurality of records;
(b) selectively outputting the one record obtained in step (a) one or more times based on a probability; and
(c) repeating steps (a) and (b) for one or more other records of the plurality of records to form a sample of the plurality of records, wherein at least one record obtained in step (a) may be output more than one time for step (b). - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
(i) determining a random number based on the probability such that the random number is greater than or equal to zero, and (ii) outputting the one record the determined random number of times.
-
-
26. The computer readable medium of claim 25, wherein the determining step (b)(i) comprises the step of determining the random number from a binomial distribution based on the probability.
-
27. The computer readable medium of claim 25, wherein the determining step (b)(i) comprises the step of determining the random number such that the random number is less than or equal to a number of record(s) remaining to be output for the sample.
-
28. The computer readable medium of claim 25, wherein the determining step (b)(i) comprises the step of determining the random number based on a probability based on a number of record(s) of the plurality of records to be evaluated for output for step (b).
-
29. The computer readable medium of claim 25, wherein the determining step (b)(i) comprises the step of determining the random number based on a probability based on a weight of the one record divided by a sum of weight(s) of record(s) of the plurality of records to be evaluated for output for step (b).
-
30. The computer readable medium of claim 25, wherein the determining step (b)(i) comprises the step of determining the random number such that the random number is less than or equal to a weight of the one record.
-
31. The computer readable medium of claim 25, wherein the determining step (b)(i) comprises the step of determining the random number based on a probability based on a fraction of the plurality of records.
-
32. The computer readable medium of claim 22, wherein the selectively outputting step (b) comprises the step of selectively outputting the one record one or more times based on a weight specified for the one record.
-
33. The computer readable medium of claim 22, wherein the selectively outputting step (b) comprises the step of selectively outputting the one record based on a probability a number of time(s) equal in number to the weight of the one record.
-
34. The computer readable medium of claim 33, wherein the selectively outputting step comprises the step of outputting the one record with a probability based on a number of record(s) remaining to be output for the sample divided by a number of possible record(s) that may be output for step (b).
-
35. The computer readable medium of claim 22, wherein the plurality of records form a relation and wherein the sample is to be joined with records of another relation.
-
36. A computer readable medium having computer-executable instructions for performing a sequential sampling of records in one pass, the computer-executable instructions for performing the steps of:
-
(a) obtaining one record from a plurality of records;
(b) selectively resetting one or more records of a reservoir to be the one record obtained in step (a) based on a probability; and
(c) repeating steps (a) and (b) for other records of the plurality of records such that the records of the reservoir form a sample of the plurality of records, wherein at least one record obtained in step (a) may be used to reset more than one record of the reservoir for step (b). - View Dependent Claims (37, 38, 39, 40, 41, 42, 43)
-
-
44. A database system for performing a sequential sampling of records in one pass in the database system, the database system comprising:
-
means for performing the step of (a) obtaining one record from a plurality of records;
means for performing the step of (b) selectively outputting the one record one or more times based on a probability; and
means for repeating steps (a) and (b) for one or more other records of the plurality of records to form a sample of the plurality of records, wherein at least one record obtained in step (a) may be output more than one time for step (b). - View Dependent Claims (45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
(i) determining a random number based on the probability such that the random number is greater than or equal to zero, and (ii) outputting the one record the determined random number of times.
-
-
48. The database system of claim 47, wherein the determining step (b)(i) comprises the step of determining the random number from a binomial distribution based on the probability.
-
49. The database system of claim 47, wherein the determining step (b)(i) comprises the step of determining the random number such that the random number is less than or equal to a number of record(s) remaining to be output for the sample.
-
50. The database system of claim 47, wherein the determining step (b)(i) comprises the step of determining the random number based on a probability based on a number of record(s) of the plurality of records to be evaluated for output for step (b).
-
51. The database system of claim 47, wherein the selectively outputting step (b) comprises the step of selectively outputting the one record one or more times based on a weight specified for the one record.
-
52. The database system of claim 47, wherein the determining step (b)(i) comprises the step of determining the random number based on a probability based on a weight of the one record divided by a sum of weight(s) of record(s) of the plurality of records to be evaluated for output for step (b).
-
53. The database system of claim 47, wherein the determining step (b)(i) comprises the step of determining the random number such that the random number is less than or equal to a weight of the one record.
-
54. The database system of claim 47, wherein the determining step (b)(i) comprises the step of determining the random number based on a probability based on a fraction of the plurality of records.
-
55. The database system, of claim 44, wherein the selectively outputting step (b) comprises the step of selectively outputting the one record based on a probability a number of time(s) equal in number to the weight of the one record.
-
56. The database system of claim 55, wherein the selectively outputting step comprises the step of outputting the one record with a probability based on a number of record(s) remaining to be output for the sample divided by a number of possible record(s) that may be output for step (b).
-
57. The database system of claim 44, wherein the plurality of records form a relation and wherein the sample is to be joined with records of another relation.
-
58. A database system for performing a sequential sampling of records in one pass in the database system, the database system comprising:
-
means for performing the step of (a) obtaining one record from a plurality of records;
means for performing the step of (b) selectively resetting one or more records of a reservoir to be the one record obtained in step (a) based on a probability; and
means for performing the step of (c) repeating steps (a) and (b) for other records of the plurality of records such that the records of the reservoir form a sample of the plurality of records, wherein at least one record obtained in step (a) may be used to reset more than one record of the reservoir for step (b). - View Dependent Claims (59, 60, 61, 62, 63, 64, 65)
-
Specification