Sampling over joins for database systems

US 6,542,886 B1
Filed: 03/15/1999
Issued: 04/01/2003
Est. Priority Date: 03/15/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method for obtaining a sample of a join of first and second relations of records in a database system, the method comprising the steps of:

(a) sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and

(b) joining one or more records of the first sample with one or more records of the second relation.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A database server supports weighted and unweighted sampling of records or tuples in accordance with desired sampling semantics such as with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics, for example. The database server may perform such sampling sequentially not only to sample non-materialized records such as those produced as a stream by a pipeline in a query tree for example, but also to sample records, whether materialized or not, in a single pass. The database server also supports sampling over a join of two relations of records or tuples without requiring the computation of the full join and without requiring the materialization of both relations and/or indexes on the join attribute values of both relations.

84 Citations

View as Search Results

43 Claims

1. A method for obtaining a sample of a join of first and second relations of records in a database system, the method comprising the steps of:
- (a) sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and
  
  (b) joining one or more records of the first sample with one or more records of the second relation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the sampling step (a) comprises the step of sampling each record of the first relation based on a weight specified for the record of the first relation based on the number of records having a matching join attribute value in the second relation.
  - 3. The method of claim 1, wherein the sampling step (a) comprises the step of using frequency statistics on join attribute values of the second relation to sample records of the first relation.
  - 4. The method of claim 1, wherein the sampling step (a) comprises the step of sampling records of the first relation using a with replacement, without replacement, or coin flip sampling technique.
  - 5. The method of claim 1, wherein the sampling step (a) comprises the step of sampling records of the first relation in one pass using a sequential sampling technique.
  - 6. The method of claim 5, wherein the first relation is produced as a stream of records as a result of a query.
  - 7. The method of claim 5, wherein the first relation is a base relation materialized in a database of the database system.
  - 8. The method of claim 1, wherein the sampling step (a) comprises the steps of:
9. The method of claim 1, wherein the sampling step (a) comprises the steps of:
- (i) initializing a reservoir of records, (ii) selectively resetting one or more records of the reservoir to be a record of the first relation based on a probability, and (iii) repeating step (a)(ii) for each record of the first relation.
10. The method of claim 1, wherein the joining step (b) comprises the steps of:
- (i) sampling a record from the second relation having a matching join attribute value with an identified record of ,the first sample, and (ii) joining the identified record of the first sample with the sampled record of the second relation.
11. The method of claim 10, wherein the joining step (b) further comprises the step of:
- (iii) repeating steps(b)(i) and (b)(ii) for each record of the first sample.
12. The method of claim 1, wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation to produce a relation having groups of records with each group corresponding to a respective one of the records of the first sample;
- andwherein the method further comprises the step of;
  
  (c) sampling one record from each group.
13. The method of claim 1, wherein the joining step (b) comprises the steps of:
- (i) sampling records from the second relation to obtain a second sample of records such that the number of records in the first sample with any one join attribute value is the same as that in the second sample, and (ii) joining the records of the first sample with the records of the second sample.
14. The method of claim 13, wherein the joining step (b)(ii) comprises the steps of:
- (A) sampling a record without replacement from the first sample having a matching join attribute value with an identified record of the second sample, (B) joining the sampled record of the first sample with the identified record of the second sample, and (C) repeating steps (b)(ii)(A) and (b)(ii)(B) for each record of the second sample.
15. The method of claim 1, wherein the sampling step (a) comprises the step of sampling records of the first relation having a matching join attribute value with at least a predetermined number of records in the second relation;
- wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation; and
  
  wherein the method further comprises the step of;
  
  (c) joining records of the first relation having a matching join attribute value with less than the predetermined number of records in the second relation with the records of the second relation.

16. A computer readable medium having computer-executable instructions for obtaining a sample of a join of first and second relations of records, the computer-executable instructions for performing the steps of:
- (a) sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and
  
  (b) joining one or more records of the first sample with one or more records of the second relation.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 17. The computer readable medium of claim 16, wherein the sampling step (a) comprises the step of sampling each record of the first relation based on a weight specified for the record of the first relation based on the number of records having a matching join attribute value in the second relation.
  - 18. The computer readable medium of claim 16, wherein the sampling step (a) comprises the step of using frequency statistics on join attribute values of the second relation to sample records of the first relation.
  - 19. The computer readable medium of claim 16, wherein the sampling step (a) comprises the step of sampling records of the first relation using a with replacement, without replacement, or coin flip sampling technique.
  - 20. The computer readable medium of claim 16, wherein the sampling step (a) comprises the step of sampling records of the first relation in one pass using a sequential sampling technique.
  - 21. The computer readable medium of claim 20, wherein the first relation is produced as a stream of records as a result of a query.
  - 22. The computer readable medium of claim 20, wherein the first relation is a base relation materialized in a database.
  - 23. The computer readable medium of claim 16, wherein the sampling step (a) comprises the steps of:
24. The computer readable medium of claim 16, wherein the sampling step (a) comprises the steps of:
- (i) initializing a reservoir of records, (ii) selectively resetting one or more records of the reservoir to be a record of the first relation based on a probability, and (iii) repeating step (a)(ii) for each record of the first relation.
25. The computer readable medium of claim 16, wherein the joining step (b) comprises the steps of:
- (i) sampling a record from the second relation having a matching join attribute value with an identified record of the first sample, and (ii) joining the identified record of the first sample with the sampled record of the second relation.
26. The computer readable medium of claim 25, wherein the joining step (b) further comprises the step of:
- (iii) repeating steps (b)(i) and (b)(ii) for each record of the first sample.
27. The computer readable medium of claim 16, wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation to produce a relation having groups of records with each group corresponding to a respective one of the records of the first sample;
- andwherein the computer readable medium comprises further computer-executable instructions for performing the step of;
  
  (c) sampling one record from each group.
28. The computer readable medium of claim 16, wherein the joining step (b) comprises the steps of:
- (i) sampling records from the second relation to obtain a second sample of records such that the number of records in the first sample with any one join attribute value is the same as that in the second sample, and (ii) joining the records of the first sample with the records of the second sample.
29. The computer readable medium of claim 28, wherein the joining step (b)(ii) comprises the steps of:
- (A) sampling a record without replacement from the first sample having a matching join attribute value with an identified record of the second sample, (B) joining the sampled record of the first sample with the identified record of the second sample, and (C) repeating steps (b)(ii)(A) and (b)(ii)(B) for each record of the second sample.
30. The computer readable medium of claim 16, wherein the sampling step (a) comprises the step of sampling records of the first relation having a matching join attribute value with at least a predetermined number of records in the second relation;
- wherein the joining step (b) comprises the step of joining the records of the first sample with the records of the second relation; and
  
  wherein the computer readable medium comprises further computer-executable instructions for performing the step of;
  
  (c) joining records of the first relation having a matching join attribute value with less than the predetermined number of records in the second relation with the records of the second relation.

31. A database system for obtaining a sample of a join of first and second relations of records in the database system, the database system comprising:
- (a) sampling means for sampling records of the first relation based on the number of records having a matching join attribute value in the second relation to obtain a first sample of records; and
  
  (b) join means for joining one or more records of the first sample with one or more records of the second relation.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43)
- - 32. The database system of claim 31, wherein the sampling means comprises means for sampling each record of the first relation based on a weight specified for the record of the first relation based on the number of records having a matching join attribute value in the second relation.
  - 33. The database system of claim 31, wherein the sampling means comprises means for using frequency statistics on join attribute values of the second relation to sample records of the first relation.
  - 34. The database system of claim 31, wherein the sampling means comprises means for sampling records of the first relation using a with replacement, without replacement, or coin flip sampling technique.
  - 35. The database system of claim 31, wherein the sampling means comprises means for sampling records of the first relation in one pass using a sequential sampling technique.
  - 36. The database system of claim 35, wherein the first relation is produced as a stream of records as a result of a query.
  - 37. The database system of claim 35, wherein the first relation is a base relation materialized in a database of the database system.
  - 38. The database system of claim 31, wherein the sampling means comprises means for selectively outputting a record of the first relation one or more times based on a probability.
  - 39. The database system of claim 31, wherein the sampling means comprises:
40. The database system of claim 31, wherein the join means comprises:
- (i) means for sampling a record from the second relation having a matching join attribute value with an identified record of the first sample, and (ii) means for joining the identified record of the first sample with the sampled record of the second relation.
41. The database system of claim 31, wherein the join means comprises means for joining the records of the first sample with the records of the second relation to produce a relation having groups of records with each group corresponding to a respective one of the records of the first sample;
- andwherein the database system comprises;
  
  (c) means for sampling one record from each group.
42. The database system of claim 31, wherein the join means comprises:
- (i) means for sampling records from the second relation to obtain a second sample of records such that the number of records in the first sample with any one join attribute value is the same as that in the second sample, and (ii) means for joining the records of the first sample with the records of the second sample.
43. The database system of claim 31, wherein the sampling means comprises means for sampling records of the first relation having a matching join attribute value with at least a predetermined number of records in the second relation;
- andwherein the join means comprises means for joining the records of the first sample with the records of the second relation and means for joining records of the first relation having a matching join attribute value with less than the predetermined number of records in the second relation with the records of the second relation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Zhigu Holdings Limited
Original Assignee
Microsoft Corporation
Inventors
Narasayya, Vivek, Chaudhuri, Surajit, Motwani, Rajeev
Primary Examiner(s)
Alam, Hosain T.
Assistant Examiner(s)
NGUYEN, TAM V

Application Number

US09/268,275
Time in Patent Office

1,478 Days
Field of Search

707/6, 707/2, 707/103, 707/5, 707/538, 707/520, 707/1, 707/3, 707/4, 707/7, 707/8, 707/9, 707/10, 707/100, 707/101, 707/102
US Class Current

1/1
CPC Class Codes

G06F 16/2456   Join operations

G06F 16/2462   Approximate or statistical ...

G06F 16/30   of unstructured textual dat...

G06F 2216/03   Data mining

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99937   Sorting

Sampling over joins for database systems

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

84 Citations

43 Claims

Specification

Solutions

Use Cases

Quick Links

Sampling over joins for database systems

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

84 Citations

43 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links