Stratified sampling of data in a database system
First Claim
Patent Images
1. A method of performing stratified sampling in a database system, comprising:
- receiving a query containing a clause indicating stratified sampling of a source table is to be preformed, the clause containing plural stratification conditions that specify plural strata in which sampling is to occur, the clause further containing sample sizes associated with respective strata; and
generating one or more commands to send to a processing module, the one or more commands containing instructions to evaluate the stratification conditions contained in the clause and to perform stratified sampling of data from the source table, the stratified sampling producing samples having respective sizes specified by the sample sizes in the clause for respective strata.
2 Assignments
0 Petitions
Accused Products
Abstract
A stratified sampling mechanism is provided in a database system. The stratified sampling mechanism includes defining a clause in a query that indicates stratified sampling is desired. Data from a source table is stratified into different subgroups based on stratification conditions in the query. Sampling is performed within each subgroup.
29 Citations
26 Claims
-
1. A method of performing stratified sampling in a database system, comprising:
-
receiving a query containing a clause indicating stratified sampling of a source table is to be preformed, the clause containing plural stratification conditions that specify plural strata in which sampling is to occur, the clause further containing sample sizes associated with respective strata; and generating one or more commands to send to a processing module, the one or more commands containing instructions to evaluate the stratification conditions contained in the clause and to perform stratified sampling of data from the source table, the stratified sampling producing samples having respective sizes specified by the sample sizes in the clause for respective strata. - View Dependent Claims (2, 3, 4, 5, 7, 8, 9, 22, 23)
-
-
6. A method of performing stratified sampling in a database system, comprising:
-
receiving a query containing a clause indicating stratified sampling of a source table is to be performed, the clause containing plural stratification conditions; generating one or more commands, the one or more commands containing instructions to evaluate the stratification conditions and to perform sampling of data from the source table, wherein the plural stratification conditions correspond to plural strata; writing data from a row of the source table into one of plural files depending on which of the stratification conditions the row satisfies; performing sampling of data in the plural files in response to the one or more commands, wherein the database system has plural access modules across which each file is partitioned, wherein performing the sampling comprises performing sampling by each of the plural access modules of data in a corresponding partition of the file; and determining a number of samples to request from each access module, wherein determining the number of samples to request from each access module comprises calculating a number that is proportional to the number of rows in the corresponding partition.
-
-
10. An article comprising at least one storage medium containing instructions that when executed cause a database system to:
-
generate one or more commands to perform stratified sampling of data contained in a relational table partitioned across plural access modules of the database system; and send the one or more commands to the plural access modules of the database system to cause the plural access modules to perform the stratified sampling in parallel, each access module to perform the stratified sampling by writing records satisfying stratification conditions to respective files corresponding to respective strata specified by the stratification conditions, and performing random sampling of records in each of the files. - View Dependent Claims (24)
-
-
11. An article comprising at least one storage medium containing instructions that when executed cause a database system to:
-
generate one or more commands to perform stratified sampling; send the one or more commands to plural access modules of the database system to cause the plural access modules to perform the stratified sampling in parallel; and receive a query containing a clause containing plural stratification conditions for the stratified sampling, the plural stratification conditions specifying plural strata, the clause further containing sample sizes for respective strata, wherein generating the one or more commands to perform the stratified sampling is in response to the received query. - View Dependent Claims (12, 13, 14)
-
-
15. A database system conspiring:
-
a storage to stare a base table; and a controller adapted to receive a request containing plural stratification conditions to divide data in the base table into corresponding plural strata, the request further containing sample sizes associated with respective strata, the controller adapted to perform random sampling, in response to the request, of data in each stratum, the random sampling producing samples corresponding to the plural strata, the samples having respective sizes specified by the sample sizes in the request. - View Dependent Claims (16, 17, 18, 19, 20, 25)
-
-
21. A database system comprising:
-
a plurality of storage modules; a plurality of access modules to manage respective storage modules; and a parsing engine to receive a stratified sampling query specifying plural stratification conditions, the parsing engine to generate one or more commands to indicate performance of the stratified sampling, the parsing engine to send the one or more commands to the access modules, in response to the one or more commands, each access module to generate plural input spool files corresponding to plural strata, the input spool files to store qualifying rows from a source table, the access module to selectively write a given row into one of the input spool files based on which stratification condition the given row satisfies, each access module to further perform random sampling of the rows in each input spool file. - View Dependent Claims (26)
-
Specification