Database Analysis Using Clusters
First Claim
1. A method for mapping relationships in a database, comprising:
- generating a directed schema graph for the database;
grouping samples of records within tables of the directed schema graph into clusters, wherein said grouping includes approximating classes for said tables indicative of at least one of a table-reach equivalent class and a join-path equivalent class;
based on the classes, identifying cluster pairs, wherein a cluster pair represents a join relationship between clusters in two tables of the database;
generating a first mapping of the cluster pairs, wherein the cluster pairs are weighted according to a number of associated records; and
storing the first mapping in a first storage device.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for mapping relationships in a database results in a cluster graph. A representative sample of records in each of a plurality of tables in the database is analyzed for nearest neighbor join edges instantiated by the record. Records with corresponding nearest neighbor join edges are grouped into clusters. Cluster pairs which share a join relationship between two tables are identified. A weighting may be applied to cluster pairs based on the number of records for the cluster pair. Meaningful cluster pairs above a weighted threshold may be ordered according to table and displayed as a cluster graph. Analyses of the cluster graph may reveal important characteristics of the database.
231 Citations
25 Claims
-
1. A method for mapping relationships in a database, comprising:
-
generating a directed schema graph for the database; grouping samples of records within tables of the directed schema graph into clusters, wherein said grouping includes approximating classes for said tables indicative of at least one of a table-reach equivalent class and a join-path equivalent class; based on the classes, identifying cluster pairs, wherein a cluster pair represents a join relationship between clusters in two tables of the database; generating a first mapping of the cluster pairs, wherein the cluster pairs are weighted according to a number of associated records; and storing the first mapping in a first storage device. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer system for generating a schema graph of a database, comprising:
-
a processor; and memory media accessible to the processor, including processor executable instructions to; determine nearest neighbor join edges for a sample of records in the tables of the database; group records within tables of the database into clusters based on corresponding nearest neighbor join edges; identify cluster pairs, wherein a cluster pair represents a join relationship between clusters in two tables of the database; and order the clusters according to the tables of the database to which the clusters belong, wherein the cluster pairs are weighted according to a number of records in clusters of a cluster pair. - View Dependent Claims (9)
-
-
10. The computer system of 8, further including processor executable instructions to:
apply a threshold to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are not included in the ordered clusters.
-
11. The computer system of 8, wherein the sample of records is a representative sample for individual tables in the database.
-
12. The computer system of 8, further including processor executable instructions to:
-
select a source cluster from a first table and a target cluster from a second table, wherein the first and second tables are different tables in the database; and calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk involves random selection of join relationships to advance between tables, such that a given cluster is not selected more than once for the random walk.
-
-
13. The computer system of 12, further including processor executable instructions to:
use the calculated probability to compare the join relationships between the first and second tables.
-
14. The computer system of 12, further including processor executable instructions to:
-
select a third table in the database, wherein the third table shares a join relationship with the first table; and use the calculated probability to determine the relative frequency, with respect to the first table, with which the second table is reached from the third table.
-
-
15. Computer-readable memory media, including processor instructions for generating schema graphs of databases, executable to:
-
group records within tables of the database into clusters based on matching nearest neighbor join edges for a sample of records in the tables, wherein records in a first cluster belonging to a first table share a join relationship, not exceeding a predetermined number of join edges, with a second cluster belonging to a second table, wherein the first and second clusters form a cluster pair; identify cluster pairs for clusters in the database; and order the clusters according to the tables to which the clusters belong, wherein the cluster pairs are weighted according to a number of records in clusters of a cluster pair.
-
-
16. The memory media of 15, further including processor executable instructions to:
apply a threshold to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are excluded from the ordered clusters.
-
17. The memory media of 15, further including processor executable instructions to:
apply a threshold to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are combined into a common cluster for respective tables.
-
18. The memory media of 15, further including processor executable instructions to:
-
select a source cluster from a first table and a target cluster from a second table, wherein the first and second tables are different tables in the database; and calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk involves random selection of join relationships to advance between tables, such that a given cluster is not selected more than once for the random walk.
-
-
19. The memory media of 18, further including processor executable instructions to:
use the calculated probability to compare the join relationships between the first and second tables.
-
20. The memory media of 18, further including processor executable instructions to:
-
select a third table in the database, wherein the third table shares a join relationship with the first table; and use the calculated probability to determine the relative frequency, with respect to the first table, with which the second table is reached from the third table.
-
-
21. The memory media of 18, further including processor executable instructions to:
repeat the instructions executable to select and to calculate, wherein a different target cluster is selected from among the clusters in a plurality of tables of the database, and wherein a plurality of probabilities corresponding to the plurality of tables in the database is calculated.
-
22. The memory media of 21, further including processor executable instructions to:
use the plurality of calculated probabilities to compare the join relationships between the plurality of tables in the database.
-
23. A method for identifying meaningful join paths in a relational database having a plurality of tables, each table having a plurality of records, comprising:
-
identifying clusters in the tables, wherein records within a cluster all instantiate a common set of neighboring join edges; identifying cluster pairs, wherein a cluster pair includes a first cluster from a first table and a second cluster from a second table and further wherein the first cluster and the second cluster share a join relationship; for at least some of the identified cluster pairs, determining a number of records in the first and second clusters that instantiate the shared relationship; and storing a cluster pair matrix indicative of the cluster pairs and their corresponding numbers of records.
-
-
24. The method of 23, wherein said identifying is performed for a representative sample of records in the tables.
-
25. The method of 23, further comprising:
weighting the cluster pairs according to their corresponding number of records, wherein a threshold is applied to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are not included in the cluster pair matrix.
Specification