Managing uncertain data using Monte Carlo techniques
First Claim
1. A computer implemented method comprising:
- specifying data uncertainty using at least one variable generation (VG) function wherein said VG function generates pseudorandom samples of uncertain data values;
specifying a random database based on said VG function;
generating, by a processor coupled to memory, a number of N Monte Carlo instantiations of said random database, wherein N is a number greater than 1;
identifying a plurality of database tuple bundles t, wherein each of the plurality of database tuple bundles t is a data structure comprising a correlated tuple from each of the N Monte Carlo instantiations;
representing the plurality of database tuple bundles t in a compressed form in which only pseudorandom numbers used to generate the uncertain data values are represented;
expanding the plurality of database tuple bundles t represented in the compressed form to an expanded form, wherein the plurality of database tuple bundles t is represented in the expanded form when all instantiated attribute values are explicitly represented;
executing, by a processor coupled to memory, a query Q over the N Monte Carlo instantiations, wherein said executing comprises;
executing a query plan for the query Q once over each of the plurality of database tuple bundles; and
outputting query-results, where zero or more numerical values that are used to estimate statistical properties of the probability distribution of the result of the query Q;
maintaining a running statistical property of query-results as each query-result is determined, wherein each query-result corresponds to one or more of the N Monte Carlo instantiations; and
after N query-results are determined, outputting the final value of the running statistical property to be the estimated statistical property of the probability distribution of the result of the query Q.
0 Assignments
0 Petitions
Accused Products
Abstract
According to one embodiment of the present invention, a method for managing uncertain data is provided. The method includes specifying data uncertainty using at least one variable generation (VG) function. The VG function generates pseudorandom samples of uncertain data values. A random database based on the VG function is specified and multiple Monte Carlo instantiations of the random database are generated. Using a Monte Carlo method, a query is repeatedly executed over the multiple Monte Carlo instantiations to output a Monte Carlo method result and associated query-results. The Monte Carlo method result may then be used to estimate statistical properties of a probability distribution of the query-result.
60 Citations
13 Claims
-
1. A computer implemented method comprising:
-
specifying data uncertainty using at least one variable generation (VG) function wherein said VG function generates pseudorandom samples of uncertain data values; specifying a random database based on said VG function; generating, by a processor coupled to memory, a number of N Monte Carlo instantiations of said random database, wherein N is a number greater than 1; identifying a plurality of database tuple bundles t, wherein each of the plurality of database tuple bundles t is a data structure comprising a correlated tuple from each of the N Monte Carlo instantiations; representing the plurality of database tuple bundles t in a compressed form in which only pseudorandom numbers used to generate the uncertain data values are represented; expanding the plurality of database tuple bundles t represented in the compressed form to an expanded form, wherein the plurality of database tuple bundles t is represented in the expanded form when all instantiated attribute values are explicitly represented; executing, by a processor coupled to memory, a query Q over the N Monte Carlo instantiations, wherein said executing comprises; executing a query plan for the query Q once over each of the plurality of database tuple bundles; and outputting query-results, where zero or more numerical values that are used to estimate statistical properties of the probability distribution of the result of the query Q; maintaining a running statistical property of query-results as each query-result is determined, wherein each query-result corresponds to one or more of the N Monte Carlo instantiations; and after N query-results are determined, outputting the final value of the running statistical property to be the estimated statistical property of the probability distribution of the result of the query Q. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-implemented method comprising:
-
specifying data uncertainty using at least one variable generation (VG) function, wherein said VG function generates pseudorandom samples of uncertain data values; specifying a random database based on said VG function; generating, by a processor coupled to memory, a number N Monte Carlo instantiations of said random database, wherein N is a number greater than 1; representing the plurality of database tuple bundles t in a compressed form in which only pseudorandom numbers used to generate the uncertain data values are represented; expanding the plurality of database tuple bundles t represented in the compressed form to an expanded form, wherein the plurality of database tuple bundles t is represented in the expanded form when all instantiated attribute values are explicitly represented; repeatedly executing a query, by the processor coupled to memory, over said multiple Monte Carlo instantiations to output a Monte Carlo result and associated query-results by identifying a plurality of database tuple bundles comprising correlated tuples from each of the Monte Carlo instantiations and executing the query on each of the plurality of database tuple bundles; and estimating statistical properties of the probability distribution of said query-result; maintaining a running statistical property of query-results as each query-result is determined, wherein each query-result corresponds to one or more of the N Monte Carlo instantiations; and after N query-results are determined, outputting the final value of the running statistical property to be the estimated statistical property of the probability distribution of the result of the query Q. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A system comprising:
-
a database containing data values and zero or more parameters tables; a variable generation (VG) function component that receives the results of SQL queries over said parameters as input and that output pseudorandom samples of said uncertain data values; a random database comprising said pseudorandom samples; specifying a random database based on said VG function; a processor generating multiple Monte Carlo instantiations of said random database; a query execution component receiving a query and executing a query over said multiple Monte Carlo instantiations to output a Monte Carlo result and associated query-results by identifying a plurality of database tuple bundles comprising correlated tuples from each of the Monte Carlo instantiations and executing the query on each of the plurality of database tuple bundles, wherein the plurality of database tuple bundles are in a compressed form in which only pseudorandom numbers used to generate the uncertain data values are represented and expand the plurality of database tuple bundles t represented in the compressed form to an expanded form, wherein the plurality of database tuple bundles t is represented in the expanded form when all instantiated attribute values are explicitly represented; and a statistical property estimator receiving said Monte Carlo result, estimating statistical properties of the probability distribution of said query result, maintaining a running statistical property of query results as each query result is determined, wherein each query result corresponds to one or more of the N Monte Carlo instantiations; and
outputting the final value of the running statistical property, after N query results are determined, wherein the running statistical property to be the estimated statistical property of the probability distribution of the result of the query Q.
-
-
13. A computer program product for managing uncertain data, said computer program product comprising a non-transitory computer readable medium having computer usable program code embodied therewith, said computer usable program code configured to:
-
specify data uncertainty using at least one variable generation (VG) function, wherein said VG function generates pseudorandom samples of uncertain data values; specify a random database based on said VG function; generate a number N Monte Carlo instantiations of said random database, wherein N is a number greater than 1; represent the plurality of database tuple bundles t in a compressed form in which only pseudorandom numbers used to generate the uncertain data values are represented; expand the plurality of database tuple bundles t represented in the compressed form to an expanded form, wherein the plurality of database tuple bundles t is represented in the expanded form when all instantiated attribute values are explicitly represented; repeatedly execute a query over said multiple Monte Carlo instantiations to output a Monte Carlo result and associated query-results by identifying a plurality of database tuple bundles comprising correlated tuples from each of the Monte Carlo instantiations and executing the query on each of the plurality of database tuple bundles; and estimate statistical properties of the probability distribution of said query-result; maintain a running statistical property of query-results as each query-result is determined, wherein each query-result corresponds to one or more of the N Monte Carlo instantiations; and after N query-results are determined, output the final value of the running statistical property to be the estimated statistical property of the probability distribution of the result of the query Q.
-
Specification