Managing uncertain data using Monte Carlo techniques
First Claim
1. A method comprising:
- specifying data uncertainty using at least one variable generation (VG) function, wherein said VG function generates pseudorandom samples of uncertain data values;
specifying a random database based on said VG function;
generating a number N Monte Carlo instantiations of said random database, wherein N is a number greater than 1;
identifying a database tuple bundle t, wherein the database tuple bundle t is a data structure representing N instantiations of a tuple in the N Monte Carlo instantiations;
using a processor, executing a query Q over the N Monte Carlo instantiations, wherein said query Q is an aggregation query, wherein said executing comprises;
executing a query plan for the query Q once over the set of all database tuple bundles; and
outputting zero or more numerical values that are used to estimate statistical properties of the probability distribution of the result of the query Q, wherein said outputting comprises outputting a set of pairs (v,f), where each v is a distinct tuple and f is a fraction of said Monte Carlo instantiations in which said tuple v appears at least once in a query result;
computing a table having entries of the form (v,n), wherein v is a query-result and n is a number of N Monte Carlo instantiations in which the query result is equal to v; and
executing a second query over the table to obtain an estimate of a statistical property of the probability distribution of the result of the query Q.
3 Assignments
0 Petitions
Accused Products
Abstract
According to one embodiment of the present invention, a method for managing uncertain data is provided. The method includes specifying data uncertainty using at least one variable generation (VG) function, wherein the VG function generates pseudorandom samples of uncertain data values. A random database based on the VG function is specified. and multiple Monte Carlo instantiations of the random database are generated. Using a Monte Carlo method, a query is repeatedly executed over the multiple Monte Carlo instantiations to output a Monte Carlo method result and associated query-results. The Monte Carlo method result may then be used to estimate statistical properties of a probability distribution of the query-result.
64 Citations
25 Claims
-
1. A method comprising:
-
specifying data uncertainty using at least one variable generation (VG) function, wherein said VG function generates pseudorandom samples of uncertain data values; specifying a random database based on said VG function;
generating a number N Monte Carlo instantiations of said random database, wherein N is a number greater than 1;identifying a database tuple bundle t, wherein the database tuple bundle t is a data structure representing N instantiations of a tuple in the N Monte Carlo instantiations; using a processor, executing a query Q over the N Monte Carlo instantiations, wherein said query Q is an aggregation query, wherein said executing comprises; executing a query plan for the query Q once over the set of all database tuple bundles; and outputting zero or more numerical values that are used to estimate statistical properties of the probability distribution of the result of the query Q, wherein said outputting comprises outputting a set of pairs (v,f), where each v is a distinct tuple and f is a fraction of said Monte Carlo instantiations in which said tuple v appears at least once in a query result; computing a table having entries of the form (v,n), wherein v is a query-result and n is a number of N Monte Carlo instantiations in which the query result is equal to v; and executing a second query over the table to obtain an estimate of a statistical property of the probability distribution of the result of the query Q. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method comprising:
-
specifying data uncertainty using at least one variable generation (VG) function, wherein said VG function generates pseudorandom samples of uncertain data values, wherein said VG function is user-defined, wherein the VG function is parameterized and is based on results of SQL queries executed over parameter tables stored in a relational database; specifying a random database based on said VG function, wherein said specifying the random database comprises using an extension of a SQL Create Table syntax that links uncertain attributes to their corresponding VG functions and specifies how the VG functions are to be parameterized prior to invocation; using a processor, generating multiple Monte Carlo instantiations of said random database; using a Monte Carlo method, repeatedly executing a query over said multiple Monte Carlo instantiations to output a Monte Carlo method result and associated query-results; and using said Monte Carlo method result, estimating statistical properties of a probability distribution of said query-result. - View Dependent Claims (9, 10, 11)
-
-
12. A system comprising:
-
a computer-readable memory; a database containing uncertain data values and zero or more parameter tables; a variable generation (VG) function component that receives the results of SQL queries over said parameter tables as input and that outputs pseudorandom samples of said uncertain data values, wherein said VG function is user-defined, wherein the VG function is parameterized and is based on results of SQL queries executed over parameter tables stored in a relational database; a random database comprising said pseudorandom samples, wherein specification of said random database comprises using an extension of a SQL Create Table syntax that links uncertain attributes to their corresponding VG functions and specifies how the VG functions are to be parameterized prior to invocation; a processor generating multiple Monte Carlo instantiations of said random database; a query execution component receiving a query and executing a query over said multiple Monte Carlo instantiations to output a Monte Carlo result and associated query-results; and a statistical property estimator receiving said Monte Carlo result and estimating statistical properties of a probability distribution of said query result. - View Dependent Claims (14, 15)
-
-
13. A computer program product for managing uncertain data, said computer program product comprising:
-
a non-transitory computer readable medium having computer usable program code embodied therewith, said computer usable program code comprising; computer usable program code configured to; specify data uncertainty using at least one variable generation (VG) function, wherein said VG function generates pseudorandom samples of uncertain data values, wherein said VG function is user-defined, wherein the VG function is parameterized and is based on results of SQL queries executed over parameter tables stored in a relational database; specify a random database based on said VG function, wherein said specifying the random database comprises using an extension of a SQL Create Table syntax that links uncertain attributes to their corresponding VG functions and specifies how the VG functions are to be parameterized prior to invocation; generate multiple Monte Carlo instantiations of said random database; using a Monte Carlo method, repeatedly execute a query over said multiple Monte Carlo instantiations to output a Monte Carlo method result and associated query-results; and using said Monte Carlo method result, estimate statistical properties of a probability distribution of said query-result. - View Dependent Claims (16, 17, 18)
-
-
19. A method comprising:
-
specifying data uncertainty using at least one variable generation (VG) function, wherein said VG function generates pseudorandom samples of uncertain data values; specifying a random database based on said VG function;
generating a number N Monte Carlo instantiations of said random database, wherein N is a number greater than 1;identifying a database tuple bundle t, wherein the database tuple bundle t is a data structure representing N instantiations of a tuple in the N Monte Carlo instantiations; using a processor, executing a query Q over the N Monte Carlo instantiations, wherein said executing comprises; executing a query plan for the query Q once over the set of all database tuple bundles; and outputting zero or more numerical values that are used to estimate statistical properties of the probability distribution of the result of the query Q; outputting N query-results, wherein each query-result corresponds to one of the N Monte Carlo instantiations; and using the N query-results, computing an estimate of a statistical property of the probability distribution of the result of the query Q, wherein computing an estimate of a statistical property of the probability distribution of the result of the query Q comprises computing an average of the N query-results and equating the average to be an estimate of the expected value of the probability distribution of the result of query Q. - View Dependent Claims (20, 21, 22, 23, 24, 25)
-
Specification