Sampling for database systems
First Claim
1. A method that samples with replacement a plurality of records in a database system in a single sequential pass of the records, comprising:
- identifying the plurality of records;
determining a sum of weights to be assigned to all records in the plurality of records;
when a record is selected to be included in the sample, tabulating a remaining weight that subtracts all weights associated with records included in the sample from the sum of weights;
for each of the plurality of records;
inputting the record to a sampling operator that determines a number of times the record is included in the sample by generating a random value from a binomial distribution using the remaining weight and a weight of the current record as parameters; and
selectively outputting and storing the record for inclusion in the sample the determined number of times, wherein the record may be included more than one time in the sample.
2 Assignments
0 Petitions
Accused Products
Abstract
A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records, such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
48 Citations
11 Claims
-
1. A method that samples with replacement a plurality of records in a database system in a single sequential pass of the records, comprising:
-
identifying the plurality of records; determining a sum of weights to be assigned to all records in the plurality of records; when a record is selected to be included in the sample, tabulating a remaining weight that subtracts all weights associated with records included in the sample from the sum of weights; for each of the plurality of records; inputting the record to a sampling operator that determines a number of times the record is included in the sample by generating a random value from a binomial distribution using the remaining weight and a weight of the current record as parameters; and selectively outputting and storing the record for inclusion in the sample the determined number of times, wherein the record may be included more than one time in the sample. - View Dependent Claims (2, 5)
-
-
3. A method that samples with replacement a plurality of database records in a database system in a single sequential pass of the database records, the method comprising:
-
(a) initializing a sample reservoir with dummy records; (b) associating a weight with each of the database records; and (c) for each database record, selectively replacing one or more dummy records in the sample reservoir with the database record according to a given probability such that multiple copies of a database record may be included in the sample, wherein the dummy records are selectively replaced in the sample reservoir with the record by considering the weight associated with the record. - View Dependent Claims (4, 6)
-
-
7. An apparatus for sampling a plurality of records in a database system in a sequential sampling of the records, the method comprising:
-
means for identifying the plurality of records; means for inputting each of the plurality of records to a sampling operator that determines a number of times the record will be included in the sample by generating a random value from a binomial distribution having a desired sample size and a number of records beings sampled as parameters; and means for selectively outputting and storing the record for inclusion in the sample the determined number of times, wherein the record is output one or more times, which may result in multiple copies of a record being included in the sample. - View Dependent Claims (8, 9)
-
-
10. An apparatus for sampling with replacement a plurality of database records in a database system in a sequential sampling of the database records, comprising:
-
(a) means for initializing a sample reservoir with dummy records; (b) weighting means configured to associate a weight with each database record and determine a sum of weights of sampled records; (c) examining means for examining each database record and selectively replacing one or more dummy records in the sample reservoir with a database record according to a probability that is based on the weight of the database record and the sum of weights, which can result in multiple copies of a record being included in the sample. - View Dependent Claims (11)
-
Specification