Systems and methods for estimating functional relationships in a database
First Claim
Patent Images
1. A system that facilitates estimating functional relationships associated with one or more columns in a database, the system, comprising at least a processor executing the following components:
- a sampling component that receives a random sample of records within the database;
an estimate generator component that calculates an estimate of strength of the functional relationships associated with the one or more columns based at least in part upon a subset of the received sample and a selected measure;
an estimate selector component that facilitates selection of a measure of strength to be calculated by the estimate generator component;
an overhead calculator component that estimates a measure of overhead associated with a column in the database by utilizing;
where
p is a sampling fraction
N is a number of rows that have a column A in a relation R, and Sk,1 and Sk,2 are two independent uniform random samples of size k drawn from the relation R;
a row strength computation component that estimates strength |{circumflex over (X)}| of a column comprising one or more default values as a key column based at least on a number of clean records within the column in the database by utilizing;
|{circumflex over (X)}|=|{circumflex over (X)}small|+|{circumflex over (X)}large| wherein, Xsmall is a set of “
dirty”
rows in a relation R that have either zero or one conflicting representative tuple pairs in a set of tuples S, and Xlarge corresponds to a set of “
dirty”
rows that have more than one conflicting pair represented in S;
the estimate generator component calculates an estimate of strength of a column as a key column as a function of the received samples utilizing the overhead calculator component, or the row strength computation component based at least on the selection of a measure of strength.
2 Assignments
0 Petitions
Accused Products
Abstract
A system that facilitates estimating functional relationships associated with one or more columns in a database comprises a sampling component that receives a random sample of records within the database. An estimate generator component calculates an estimate of strength of functional relationships based at least in part upon the received samples. For example, the estimate generator component can calculate an estimate of strength of a column as a key column based at least in part upon the received samples.
-
Citations
14 Claims
-
1. A system that facilitates estimating functional relationships associated with one or more columns in a database, the system, comprising at least a processor executing the following components:
-
a sampling component that receives a random sample of records within the database; an estimate generator component that calculates an estimate of strength of the functional relationships associated with the one or more columns based at least in part upon a subset of the received sample and a selected measure; an estimate selector component that facilitates selection of a measure of strength to be calculated by the estimate generator component; an overhead calculator component that estimates a measure of overhead associated with a column in the database by utilizing;
where
p is a sampling fraction
N is a number of rows that have a column A in a relation R, and Sk,1 and Sk,2 are two independent uniform random samples of size k drawn from the relation R;a row strength computation component that estimates strength |{circumflex over (X)}| of a column comprising one or more default values as a key column based at least on a number of clean records within the column in the database by utilizing;
|{circumflex over (X)}|=|{circumflex over (X)}small|+|{circumflex over (X)}large|wherein, Xsmall is a set of “
dirty”
rows in a relation R that have either zero or one conflicting representative tuple pairs in a set of tuples S, and Xlarge corresponds to a set of “
dirty”
rows that have more than one conflicting pair represented in S;the estimate generator component calculates an estimate of strength of a column as a key column as a function of the received samples utilizing the overhead calculator component, or the row strength computation component based at least on the selection of a measure of strength. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
where, N is a number of rows that have a column A in the relation R, z2 is a number of conflicting groups of size two in the random sample, and k is a size of the random sample.
-
-
9. The system of claim 8, wherein the estimate generator component utilizes the following algorithm in connection with calculating an estimate of strength of functional relationships:
-
where z1 is a total number of “
dirty”
rows over all large groups in the sample.
-
-
10. The system of claim 1, further comprising a machine-learning component that generates inferences regarding a type of measure of strength to be estimated by the estimate generator component by analyzing contextual data and historical data.
-
11. A method for estimating strength of key dependencies in a database, comprising the following executable by a processor:
-
receiving random samples from the database, the samples are associated with a column comprising one or more default values associated therewith; selecting a measure of strength to be estimated for assessing strength of the column as a key column; estimating a measure of overhead associated with the column in the database by utilizing;
where
p is a sampling fraction
N is a number of rows that have a column A in a relation R, and Sk,1 and Sk,2 are two independent uniform random samples of size k drawn from the relation R, if a overhead associated with the column is selected as the measure of strength;estimating strength |{circumflex over (X)}| of the column comprising the one or more default values as a key column based at least on a number of clean records within the column in the database by utilizing;
|{circumflex over (X)}|=|{circumflex over (X)}small|+|{circumflex over (X)}large|wherein, Xsmall is a set of “
dirty”
rows in a relation R that have either zero or one conflicting representative tuple pairs in a set of tuples S, and Xlarge corresponds to a set of “
dirty”
rows that have more than one conflicting pair represented in S, if a row strength computation is selected as the measure of strength;calculating an estimate of strength of a column as a key column as a function of the received samples utilizing the overhead calculator component or the row strength computation component based at least on the selection. - View Dependent Claims (12, 13, 14)
-
Specification