Information reservoir
First Claim
1. A computer-implemented information reservoir creation process wherein:
- a table collection is constructed from a data source;
said table collection includes a subset of tables designated as sampling initiation tables;
each table in said table collection is a member of either a directly-sampled table set or a descendent-sampled table set;
said directly-sampled table set is characterized by tables that are either sampling initiation tables or ancestor tables to one or more sampling initiation tables;
said descendant-sampled table set is characterized by tables that are descendant tables to a sampling initiation table;
said table collection is characterized by a table collection schema equivalent to a data source schema of said data source, with the exception that a list of attributes for each table of said directly-sampled table set includes an additional attribute containing actual rate of inclusion values;
each tuple included in said table collection is equivalent to one and only one tuple in the corresponding table of said data source;
an actual rate of inclusion value stored with a select data source tuple and included in a directly-sampled table of said table collection represents the probability that a randomly selected table collection produced by the process will contain said select data source tuple.
2 Assignments
0 Petitions
Accused Products
Abstract
Approximate answers to queries are provided by executing queries against a representation of a data source in addition to, or in lieu of accessing the source data itself. A representation of a data source, referred to herein as an Information Reservoir, is constructed and maintained using probabilistic methodologies based upon a Poisson sampling approach. The Information Reservoir provides approximate answers to ad hoc queries, potentially in a small fraction of the time required to calculate an exact answer. Associated variances are also provided that may additionally be used to calculate confidence intervals bounding the exact answer. An Information Reservoir may be biased toward a subset of the information in the original data source and/or tailored to the anticipated query workload. Queries expressed as if directed to the original data source may be automatically translated to run against the Information Reservoir with little or no additional burden placed on the Information Reservoir user. Information Reservoir collections may be created that offer users approximate answers of varying levels of precision. Information Reservoirs may also be combined with non-sampling concise representations to increase the precision of approximate answers for certain classes of queries. For example, approximations to specific multidimensional histograms may be combined with an Information Reservoir to accommodate highly selective queries that sampling does not effectively address.
652 Citations
111 Claims
-
1. A computer-implemented information reservoir creation process wherein:
-
a table collection is constructed from a data source;
said table collection includes a subset of tables designated as sampling initiation tables;
each table in said table collection is a member of either a directly-sampled table set or a descendent-sampled table set;
said directly-sampled table set is characterized by tables that are either sampling initiation tables or ancestor tables to one or more sampling initiation tables;
said descendant-sampled table set is characterized by tables that are descendant tables to a sampling initiation table;
said table collection is characterized by a table collection schema equivalent to a data source schema of said data source, with the exception that a list of attributes for each table of said directly-sampled table set includes an additional attribute containing actual rate of inclusion values;
each tuple included in said table collection is equivalent to one and only one tuple in the corresponding table of said data source;
an actual rate of inclusion value stored with a select data source tuple and included in a directly-sampled table of said table collection represents the probability that a randomly selected table collection produced by the process will contain said select data source tuple. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented method for constructing a representation from a data source in order to provide relatively quick response to queries related to information in said data source, wherein said data source has a plurality of tuples stored in said data source and a data source schema that includes defined relationships among at least a subset of the tuples in the data source, said method comprising:
-
creating said representation by copying at least a subset of said data source schema to define a representation schema;
adding additional data to said representation that represents information that is not in said data source;
defining tuples of interest within said data source and a degree of interest for each tuple of interest;
sampling tuples from said tuples of interest into said representation based upon said degree of interest in a manner that preserves at least a subset of said relationships among tuples in the data source; and
storing values in the representation that relate to a likelihood that each tuple sampled into said representation would be sampled into the representation if the sampling process were to be repeated. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77)
-
-
78. A system for constructing a representation from a data source in order to provide response to queries related to information in said data source, wherein said data source has a plurality of tuples stored in said data source and a data source schema that includes defined relationships among at least a subset of the tuples in the data source, said system comprising:
-
at least one processor;
at least one storage device communicably coupled to said at least one processor arranged to store said data source and said representation; and
software executable by said at least one processor for;
creating said representation by copying at least a subset of said data source schema to define a representation schema;
adding additional data to said representation that represents information that is not in said data source;
defining tuples of interest within said data source and a degree of interest for each tuple of interest;
sampling tuples from said tuples of interest into said representation based upon said degree of interest in a manner that preserves at least a subset of said relationships among tuples in the data source; and
storing values in the representation that relate to the likelihood that each tuple sampled into said representation would be sampled into the representation if the sampling process were to be repeated. - View Dependent Claims (79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95)
-
-
96. A computer readable medium including program code representing computer implemented operations for constructing a representation from a data source in order to provide relatively quick response to queries related to information in said data source, wherein said data source has a plurality of tuples stored in said data source and a data source schema that includes defined relationships among at least a subset of the tuples in the data source, said operations comprising:
-
creating said representation by copying at least a subset of said data source schema to define a representation schema;
adding additional data to said representation that represents information that is not in said data source;
defining tuples of interest within said data source and a degree of interest for each tuple of interest;
sampling tuples from said tuples of interest into said representation based upon said degree of interest in a manner that preserves at least a subset of said relationships among tuples in the data source; and
storing values in the representation that relate to the likelihood that each tuple sampled into said representation would be sampled into the representation if the sampling process were to be repeated.
-
-
97. A method for translating simple SQL queries directed at sampling initiation and ancestor-sampled tables of a data source into revised SQL queries directed at an Information Reservoir or Information Reservoir collection created from said data source in order to calculate both approximate query answers and variances for the approximate answers, said method comprising:
-
comparing said simple SQL query to a list containing both SQL query types that can be translated and the associated translation rule or rules to be applied for each SQL query type that can be translated; and
applying said translation rule or rules associated with said simple SQL query to translate said simple SQL query into a revised query directed at said Information Reservoir or Information Reservoir collection created from said data source. - View Dependent Claims (98, 99, 100)
-
-
101. A computer-implemented method for translating queries directed at a data source into revised queries directed at an Information Reservoir or Information Reservoir collection created from said data source in order to calculate both approximate query answers and variances for the approximate answers, said method comprising:
-
translating queries directed at said data source into a sequence of atomic operations that act on said data source; and
translating atomic operations that act on said data source to atomic operations that act on said Information Reservoir or Information Reservoir collection in order to calculate both approximate query answers and variances for the approximate answers. - View Dependent Claims (102, 103, 104, 105, 106, 107, 108)
-
-
109. A query system for use with an Information Reservoir or Information Reservoir collection created from a data source comprising at least one processor programmed to:
-
translate queries directed against said data source into a sequence of atomic operations that act on said data source; and
translate atomic operations that act on said data source to atomic operations that act on said Information Reservoir or Information Reservoir collection in order to calculate both approximate query answers and variances for the approximate answers. - View Dependent Claims (110)
-
-
111. A computer readable medium including program code representing computer implemented operations for constructing a representation from a data source in order to provide relatively quick response to queries related to information in said data source, wherein said data source has a plurality of tuples stored in said data source and a data source schema that includes defined relationships among at least a subset of the tuples in the data source, said operations comprising:
-
creating said representation by copying at least a subset of said data source schema to define a representation schema;
adding additional data to said representation that represents information that is not in said data source;
defining tuples of interest within said data source and a degree of interest for each tuple of interest;
sampling tuples from said tuples of interest into said representation based upon said degree of interest in a manner that preserves at least a subset of said relationships among tuples in the data source; and
storing values in the representation that relate to the likelihood that each tuple sampled into said representation would be sampled into the representation if the sampling process were to be repeated.
-
Specification