Systems and methods for estimating functional relationships in a database

US 7,562,067 B2
Filed: 05/06/2005
Issued: 07/14/2009
Est. Priority Date: 05/06/2005
Status: Active Grant

First Claim

Patent Images

1. A system that facilitates estimating functional relationships associated with one or more columns in a database, the system, comprising at least a processor executing the following components:

a sampling component that receives a random sample of records within the database;

an estimate generator component that calculates an estimate of strength of the functional relationships associated with the one or more columns based at least in part upon a subset of the received sample and a selected measure;

an estimate selector component that facilitates selection of a measure of strength to be calculated by the estimate generator component;

an overhead calculator component that estimates a measure of overhead associated with a column in the database by utilizing;

$Estimated Overhead (A) = \frac{N}{\hat{S} J_{A} (R)},$

where $\hat{S} J_{A} (R) = \frac{1}{p^{2}} \cdot \langle S_{k, 1} A S_{k, 2} \rangle,$

p is a sampling fraction $(\frac{k}{N}),$

N is a number of rows that have a column A in a relation R, and S_k,1and S_k,2are two independent uniform random samples of size k drawn from the relation R;

a row strength computation component that estimates strength |{circumflex over (X)}| of a column comprising one or more default values as a key column based at least on a number of clean records within the column in the database by utilizing;

|{circumflex over (X)}|=|{circumflex over (X)}_small|+|{circumflex over (X)}_large| wherein, X_smallis a set of “

dirty”

rows in a relation R that have either zero or one conflicting representative tuple pairs in a set of tuples S, and X_largecorresponds to a set of “

dirty”

rows that have more than one conflicting pair represented in S;

the estimate generator component calculates an estimate of strength of a column as a key column as a function of the received samples utilizing the overhead calculator component, or the row strength computation component based at least on the selection of a measure of strength.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.

Citations

14 Claims

1. A system that facilitates estimating functional relationships associated with one or more columns in a database, the system, comprising at least a processor executing the following components:
- a sampling component that receives a random sample of records within the database;
  
  an estimate generator component that calculates an estimate of strength of the functional relationships associated with the one or more columns based at least in part upon a subset of the received sample and a selected measure;
  
  an estimate selector component that facilitates selection of a measure of strength to be calculated by the estimate generator component;
  
  an overhead calculator component that estimates a measure of overhead associated with a column in the database by utilizing;
  
  $Estimated Overhead (A) = \frac{N}{\hat{S} J_{A} (R)},$
  
  where $\hat{S} J_{A} (R) = \frac{1}{p^{2}} \cdot \langle S_{k, 1} A S_{k, 2} \rangle,$
  
  p is a sampling fraction $(\frac{k}{N}),$
  
  N is a number of rows that have a column A in a relation R, and S_k,1and S_k,2are two independent uniform random samples of size k drawn from the relation R;
  
  a row strength computation component that estimates strength |{circumflex over (X)}| of a column comprising one or more default values as a key column based at least on a number of clean records within the column in the database by utilizing;
  
  |{circumflex over (X)}|=|{circumflex over (X)}_small|+|{circumflex over (X)}_large| wherein, X_smallis a set of “
  
  dirty”
  
  rows in a relation R that have either zero or one conflicting representative tuple pairs in a set of tuples S, and X_largecorresponds to a set of “
  
  dirty”
  
  rows that have more than one conflicting pair represented in S;
  
  the estimate generator component calculates an estimate of strength of a column as a key column as a function of the received samples utilizing the overhead calculator component, or the row strength computation component based at least on the selection of a measure of strength.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, further comprising a randomization component that is employed to provide the sampling component with the random sample of records.
  - 3. The system of claim 1, wherein the overhead calculator component utilizes a self-join algorithm in connection with estimating the measure of overhead.
  - 4. The system of claim 3, further comprising an error threshold component that determines a number of sample records to provide to the sampling component based at least in part upon a threshold amount of error allowed for the estimate generator component.
  - 5. The system of claim 1, wherein the estimate generator component employs a row strength computation component that estimates a number of clean records within a column in the database.
  - 6. The system of claim 1, further comprising a monitoring component that determines a number of sample records to provide to the sampling component based at least in part upon size of the database.
  - 7. The system of claim 1, further comprising monitoring component that determines a number of sample records to provide to the sampling component based at least in part upon a threshold performance associated with the estimate generator component.
  - 8. The system of claim 1, wherein the estimate generator component utilizes the following algorithm in connection with calculating an estimate of strength of functional relationships:
9. The system of claim 8, wherein the estimate generator component utilizes the following algorithm in connection with calculating an estimate of strength of functional relationships:
- $\langle {\hat{X}}_{large} \rangle = z_{1} \frac{N}{k},$ where z₁is a total number of “
  
  dirty”
  
  rows over all large groups in the sample.
10. The system of claim 1, further comprising a machine-learning component that generates inferences regarding a type of measure of strength to be estimated by the estimate generator component by analyzing contextual data and historical data.

11. A method for estimating strength of key dependencies in a database, comprising the following executable by a processor:
- receiving random samples from the database, the samples are associated with a column comprising one or more default values associated therewith;
  
  selecting a measure of strength to be estimated for assessing strength of the column as a key column;
  
  estimating a measure of overhead associated with the column in the database by utilizing;
  
  $Estimated Overhead (A) = \frac{N}{\hat{S} J_{A} (R)},$
  
  where $\hat{S} J_{A} (R) = \frac{1}{p^{2}} \cdot \langle S_{k, 1} A S_{k, 2} \rangle,$
  
  p is a sampling fraction $(\frac{k}{N}),$
  
  N is a number of rows that have a column A in a relation R, and S_k,1and S_k,2are two independent uniform random samples of size k drawn from the relation R, if a overhead associated with the column is selected as the measure of strength;
  
  estimating strength |{circumflex over (X)}| of the column comprising the one or more default values as a key column based at least on a number of clean records within the column in the database by utilizing;
  
  |{circumflex over (X)}|=|{circumflex over (X)}_small|+|{circumflex over (X)}_large| wherein, X_smallis a set of “
  
  dirty”
  
  rows in a relation R that have either zero or one conflicting representative tuple pairs in a set of tuples S, and X_largecorresponds to a set of “
  
  dirty”
  
  rows that have more than one conflicting pair represented in S, if a row strength computation is selected as the measure of strength;
  
  calculating an estimate of strength of a column as a key column as a function of the received samples utilizing the overhead calculator component or the row strength computation component based at least on the selection.
- View Dependent Claims (12, 13, 14)
- - 12. The method of claim 11, further comprising:
    - determining size of the database; and
      
      determining one of a number and size of the samples based at least in part upon the determined size.
  - 13. The method of claim 11, further comprising:
    - computing a sampling fraction; and
      
      estimating strength of the column as a key column based at least in part upon the computed sampling fraction.
  - 14. The method of claim 11, further comprising:
    - defining a threshold amount of error tolerance that can be associated with the estimated strength; and
      
      determining one of a number and size of the samples based at least in part upon the defined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Shriraghav, Kaushik
Primary Examiner(s)
Stevens; Robert

Application Number

US11/123,901
Publication Number

US 20060282436A1
Time in Patent Office

1,530 Days
Field of Search

707/100, 707/2
US Class Current

1/1
CPC Class Codes

G06F 16/2462 Approximate or statistical ...

Y10S 707/99932 Access augmentation or opti...

Systems and methods for estimating functional relationships in a database

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for estimating functional relationships in a database

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links