Database analysis using clusters
First Claim
1. A method for mapping relationships in a database, the database including a plurality of tables having a table join structure, wherein the table join structure is indicated by table join edges in a schema graph of the database and wherein each of the plurality of tables includes a corresponding set of records, the method comprising:
- for each of the plurality of tables, grouping, by a computer system, a sample of the corresponding set of records into clusters, wherein records grouped in a cluster instantiate a common set of table join edges;
identifying cluster pairs, wherein a cluster pair corresponds to two clusters from different tables, wherein the two clusters instantiate a common table join edge;
weighting the cluster pairs according to a number of records that instantiate the common table join edge;
filtering any cluster pairs weighted below a threshold weighting, wherein the filtering includes a process selected from excluding the cluster pairs weighted below the threshold and combining each cluster associated with each cluster pair weighted below the threshold weighting with another cluster;
selecting a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables;
selecting a third table in the database, wherein the third table shares a table join edge with the first table; and
determining a relative frequency, with respect to the first table, with which the second table is reached from the third table.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for mapping relationships in a database results in a cluster graph. A representative sample of records in each of a plurality of tables in the database is analyzed for nearest neighbor join edges instantiated by the record. Records with corresponding nearest neighbor join edges are grouped into clusters. Cluster pairs which share a join relationship between two tables are identified. A weighting may be applied to cluster pairs based on the number of records for the cluster pair. Meaningful cluster pairs above a weighted threshold may be ordered according to table and displayed as a cluster graph. Analyses of the cluster graph may reveal important characteristics of the database.
147 Citations
19 Claims
-
1. A method for mapping relationships in a database, the database including a plurality of tables having a table join structure, wherein the table join structure is indicated by table join edges in a schema graph of the database and wherein each of the plurality of tables includes a corresponding set of records, the method comprising:
-
for each of the plurality of tables, grouping, by a computer system, a sample of the corresponding set of records into clusters, wherein records grouped in a cluster instantiate a common set of table join edges; identifying cluster pairs, wherein a cluster pair corresponds to two clusters from different tables, wherein the two clusters instantiate a common table join edge; weighting the cluster pairs according to a number of records that instantiate the common table join edge; filtering any cluster pairs weighted below a threshold weighting, wherein the filtering includes a process selected from excluding the cluster pairs weighted below the threshold and combining each cluster associated with each cluster pair weighted below the threshold weighting with another cluster; selecting a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables; selecting a third table in the database, wherein the third table shares a table join edge with the first table; and determining a relative frequency, with respect to the first table, with which the second table is reached from the third table. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer system for generating a cluster graph of a database, the database including a plurality of tables having a table join structure, wherein the table join structure is indicated by table join edges in a schema graph of the database, wherein each of the plurality of tables includes a corresponding set of records, the computer system comprising:
-
a processor; and memory media, accessible to the processor, including processor executable program instructions, the program instructions including instructions to; determine, for each given table in the plurality of tables, nearest neighbor table join edges associated with each record in a sample of records from the given table; group the sample of records for each given table into clusters based on the nearest neighbor table join edges; identify cluster pairs, wherein a cluster pair corresponds to two clusters from different tables of the database, wherein the two clusters instantiate a common table join edge; weight the cluster pairs according to a number of records that instantiate the common table join edge; filter any cluster pairs weighted below a threshold weighting, wherein the filtering includes a process selected from excluding the cluster pairs weighted below the threshold and combining clusters in each cluster pair weighted below the threshold weighting with a different cluster; select a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables; select a third table in the database, wherein the third table shares a table join edge with the first table; and determine a relative frequency, with respect to the first table, with which the second table is reached from the third table. - View Dependent Claims (7)
-
-
8. The computer system of 6, wherein a sample size of each of the sample of records is based at least in part on a number of records in a corresponding table.
-
9. The computer system of 6, further including processor executable instructions to:
calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk includes randomly selecting table join edges to advance between neighboring clusters, such that a given cluster is not selected more than once for the random walk.
-
10. The computer system of 9, further including processor executable instructions to:
compare join relationships between the first table and the second table based on the probability.
-
11. The computer system of 9, wherein the instructions to determine the relative frequency include instructions to determine the relative frequency based on the probability.
-
12. A non-transitory computer-readable memory, including program instructions, executable by a processor, for generating cluster graphs of databases, the program instructions including instructions to:
-
group records within each of a plurality of database tables into clusters based on matching nearest neighbor table join edges, wherein a first cluster associated with a first table shares a join relationship with a second cluster associated with a second table, wherein the first cluster and the second cluster form a cluster pair; identify cluster pairs for clusters in the database; weight the cluster pairs according to a number of records that instantiate the join relationship shared by the first cluster and the second cluster; apply a weighting threshold to the cluster pairs, wherein cluster pairs weighted below the weighting threshold are excluded from the ordered clusters; select a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables; select a third table in the database, wherein the third table shares a table join edge with the first table; and determine a relative frequency, with respect to the first table, with which the second table is reached from the third table.
-
-
13. The memory media of 12, wherein the program instructions include instructions to:
calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk includes a random selection of table join edges to advance between tables, such that a given cluster is not selected more than once for the random walk.
-
14. The memory media of 13, further including processor executable instructions to:
compare the join relationships between the first table and the second table based on the probability.
-
15. The memory media of 13, wherein the instructions executable to determine the relative frequency include:
instructions to determine the relative frequency based on the probability.
-
16. The memory media of 13, wherein the program instructions include instructions to:
repeat the program instructions to select and to calculate, wherein a different target cluster is selected from among the clusters calculate a plurality of probabilities corresponding to the plurality of tables in the database.
-
17. The memory media of 16, further including processor executable instructions to:
compare join relationships between the plurality of tables in the database based on the plurality of probabilities.
-
18. A method for identifying meaningful join paths in a relational database having a plurality of tables, each table having a set of records, comprising:
-
identifying clusters in the tables, wherein records within a cluster all instantiate a common set of neighboring join edges; identifying cluster pairs, wherein a cluster pair includes a first cluster from a first table and a second cluster from a second table and further wherein the first cluster and the second cluster share a join relationship; for at least some of the identified cluster pairs, determining a number of records in the first cluster and the second cluster that instantiate the join relationship shared; storing a cluster pair matrix indicative of the cluster pairs and their corresponding numbers of records; weighting the cluster pairs according to their corresponding number of records, wherein a threshold is applied to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are not included in the cluster pair matrix; selecting a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables; selecting a third table in the database, wherein the third table shares a table join edge with the first table; and determining, based on the probability, a relative frequency, with respect to the first table, with which the second table is reached from the third table.
-
-
19. The method of 18, wherein the identifying of clusters in a selected table is performed for a representative sample of the set of records in the given table.
Specification