Sampling over joins for database systems
First Claim
1. A method for obtaining a sample of a join of first and second relations of records in a database system, the method comprising the steps of:
- (a) sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and
(b) joining one or more records of the first sample with one or more records of the second relation.
4 Assignments
0 Petitions
Accused Products
Abstract
A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.
84 Citations
43 Claims
-
1. A method for obtaining a sample of a join of first and second relations of records in a database system, the method comprising the steps of:
-
(a) sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and
(b) joining one or more records of the first sample with one or more records of the second relation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
(i) selectively outputting a record of the first relation one or more times based on a probability; and
(ii) repeating step (a)(i) for each record of the first relation.
-
-
9. The method of claim 1, wherein the sampling step (a) comprises the steps of:
-
(i) initializing a reservoir of records, (ii) selectively resetting one or more records of the reservoir to be a record of the first relation based on a probability, and (iii) repeating step (a)(ii) for each record of the first relation.
-
-
10. The method of claim 1, wherein the joining step (b) comprises the steps of:
-
(i) sampling a record from the second relation having a matching join attribute value with an identified record of ,the first sample, and (ii) joining the identified record of the first sample with the sampled record of the second relation.
-
-
11. The method of claim 10, wherein the joining step (b) further comprises the step of:
(iii) repeating steps(b)(i) and (b)(ii) for each record of the first sample.
-
12. The method of claim 1, wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation to produce a relation having groups of records with each group corresponding to a respective one of the records of the first sample;
- and
wherein the method further comprises the step of;
(c) sampling one record from each group.
- and
-
13. The method of claim 1, wherein the joining step (b) comprises the steps of:
-
(i) sampling records from the second relation to obtain a second sample of records such that the number of records in the first sample with any one join attribute value is the same as that in the second sample, and (ii) joining the records of the first sample with the records of the second sample.
-
-
14. The method of claim 13, wherein the joining step (b)(ii) comprises the steps of:
-
(A) sampling a record without replacement from the first sample having a matching join attribute value with an identified record of the second sample, (B) joining the sampled record of the first sample with the identified record of the second sample, and (C) repeating steps (b)(ii)(A) and (b)(ii)(B) for each record of the second sample.
-
-
15. The method of claim 1, wherein the sampling step (a) comprises the step of sampling records of the first relation having a matching join attribute value with at least a predetermined number of records in the second relation;
-
wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation; and
wherein the method further comprises the step of;
(c) joining records of the first relation having a matching join attribute value with less than the predetermined number of records in the second relation with the records of the second relation.
-
-
16. A computer readable medium having computer-executable instructions for obtaining a sample of a join of first and second relations of records, the computer-executable instructions for performing the steps of:
-
(a) sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and
(b) joining one or more records of the first sample with one or more records of the second relation. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
(i) selectively outputting a record of the first relation one or more times based on a probability; and
(ii) repeating step (a)(i) for each record of the first relation.
-
-
24. The computer readable medium of claim 16, wherein the sampling step (a) comprises the steps of:
-
(i) initializing a reservoir of records, (ii) selectively resetting one or more records of the reservoir to be a record of the first relation based on a probability, and (iii) repeating step (a)(ii) for each record of the first relation.
-
-
25. The computer readable medium of claim 16, wherein the joining step (b) comprises the steps of:
-
(i) sampling a record from the second relation having a matching join attribute value with an identified record of the first sample, and (ii) joining the identified record of the first sample with the sampled record of the second relation.
-
-
26. The computer readable medium of claim 25, wherein the joining step (b) further comprises the step of:
(iii) repeating steps (b)(i) and (b)(ii) for each record of the first sample.
-
27. The computer readable medium of claim 16, wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation to produce a relation having groups of records with each group corresponding to a respective one of the records of the first sample;
- and
wherein the computer readable medium comprises further computer-executable instructions for performing the step of;
(c) sampling one record from each group.
- and
-
28. The computer readable medium of claim 16, wherein the joining step (b) comprises the steps of:
-
(i) sampling records from the second relation to obtain a second sample of records such that the number of records in the first sample with any one join attribute value is the same as that in the second sample, and (ii) joining the records of the first sample with the records of the second sample.
-
-
29. The computer readable medium of claim 28, wherein the joining step (b)(ii) comprises the steps of:
-
(A) sampling a record without replacement from the first sample having a matching join attribute value with an identified record of the second sample, (B) joining the sampled record of the first sample with the identified record of the second sample, and (C) repeating steps (b)(ii)(A) and (b)(ii)(B) for each record of the second sample.
-
-
30. The computer readable medium of claim 16, wherein the sampling step (a) comprises the step of sampling records of the first relation having a matching join attribute value with at least a predetermined number of records in the second relation;
-
wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation; and
wherein the computer readable medium comprises further computer-executable instructions for performing the step of;
(c) joining records of the first relation having a matching join attribute value with less than the predetermined number of records in the second relation with the records of the second relation.
-
-
31. A database system for obtaining a sample of a join of first and second relations of records in the database system, the database system comprising:
-
(a) sampling means for sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and
(b) join means for joining one or more records of the first sample with one or more records of the second relation. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43)
(i) means for initializing a reservoir of records, and (ii) means for selectively resetting one or more records of the reservoir to be a record of the first relation based on a probability.
-
-
40. The database system of claim 31, wherein the join means comprises:
-
(i) means for sampling a record from the second relation having a matching join attribute value with an identified record of the first sample, and (ii) means for joining the identified record of the first sample with the sampled record of the second relation.
-
-
41. The database system of claim 31, wherein the join means comprises means for joining the records of the first sample with the records of the second relation to produce a relation having groups of records with each group corresponding to a respective one of the records of the first sample;
- and
wherein the database system comprises;
(c) means for sampling one record from each group.
- and
-
42. The database system of claim 31, wherein the join means comprises:
-
(i) means for sampling records from the second relation to obtain a second sample of records such that the number of records in the first sample with any one join attribute value is the same as that in the second sample, and (ii) means for joining the records of the first sample with the records of the second sample.
-
-
43. The database system of claim 31, wherein the sampling means comprises means for sampling records of the first relation having a matching join attribute value with at least a predetermined number of records in the second relation;
- and
wherein the join means comprises means for joining the records of the first sample with the records of the second relation and means for joining records of the first relation having a matching join attribute value with less than the predetermined number of records in the second relation with the records of the second relation.
- and
Specification