Database analysis using clusters

US 8,161,048 B2
Filed: 05/08/2009
Issued: 04/17/2012
Est. Priority Date: 04/24/2009
Status: Active Grant

First Claim

Patent Images

1. A method for mapping relationships in a database, the database including a plurality of tables having a table join structure, wherein the table join structure is indicated by table join edges in a schema graph of the database and wherein each of the plurality of tables includes a corresponding set of records, the method comprising:

for each of the plurality of tables, grouping, by a computer system, a sample of the corresponding set of records into clusters, wherein records grouped in a cluster instantiate a common set of table join edges;

identifying cluster pairs, wherein a cluster pair corresponds to two clusters from different tables, wherein the two clusters instantiate a common table join edge;

weighting the cluster pairs according to a number of records that instantiate the common table join edge;

filtering any cluster pairs weighted below a threshold weighting, wherein the filtering includes a process selected from excluding the cluster pairs weighted below the threshold and combining each cluster associated with each cluster pair weighted below the threshold weighting with another cluster;

selecting a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables;

selecting a third table in the database, wherein the third table shares a table join edge with the first table; and

determining a relative frequency, with respect to the first table, with which the second table is reached from the third table.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for mapping relationships in a database results in a cluster graph. A representative sample of records in each of a plurality of tables in the database is analyzed for nearest neighbor join edges instantiated by the record. Records with corresponding nearest neighbor join edges are grouped into clusters. Cluster pairs which share a join relationship between two tables are identified. A weighting may be applied to cluster pairs based on the number of records for the cluster pair. Meaningful cluster pairs above a weighted threshold may be ordered according to table and displayed as a cluster graph. Analyses of the cluster graph may reveal important characteristics of the database.

147 Citations

19 Claims

1. A method for mapping relationships in a database, the database including a plurality of tables having a table join structure, wherein the table join structure is indicated by table join edges in a schema graph of the database and wherein each of the plurality of tables includes a corresponding set of records, the method comprising:
- for each of the plurality of tables, grouping, by a computer system, a sample of the corresponding set of records into clusters, wherein records grouped in a cluster instantiate a common set of table join edges;
  
  identifying cluster pairs, wherein a cluster pair corresponds to two clusters from different tables, wherein the two clusters instantiate a common table join edge;
  
  weighting the cluster pairs according to a number of records that instantiate the common table join edge;
  
  filtering any cluster pairs weighted below a threshold weighting, wherein the filtering includes a process selected from excluding the cluster pairs weighted below the threshold and combining each cluster associated with each cluster pair weighted below the threshold weighting with another cluster;
  
  selecting a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables;
  
  selecting a third table in the database, wherein the third table shares a table join edge with the first table; and
  
  determining a relative frequency, with respect to the first table, with which the second table is reached from the third table.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein each sample is a random sample of records in a given table, and wherein a sample size of each sample is proportional to a total number of records for the given table.
  - 3. The method of claim 1, wherein the table join structure is limited to a maximum number of table join edges.
  - 4. The method of claim 1, further comprising:
    - displaying a cluster graph indicative of the cluster pairs and their corresponding weightings on a display device.
  - 5. The method of claim 1, further comprising:
    - calculating a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk includes randomly selecting table join edges to advance between neighboring clusters, such that a given cluster is not selected more than once for the random walk.

6. A computer system for generating a cluster graph of a database, the database including a plurality of tables having a table join structure, wherein the table join structure is indicated by table join edges in a schema graph of the database, wherein each of the plurality of tables includes a corresponding set of records, the computer system comprising:
- a processor; and
  
  memory media, accessible to the processor, including processor executable program instructions, the program instructions including instructions to;
  
  determine, for each given table in the plurality of tables, nearest neighbor table join edges associated with each record in a sample of records from the given table;
  
  group the sample of records for each given table into clusters based on the nearest neighbor table join edges;
  
  identify cluster pairs, wherein a cluster pair corresponds to two clusters from different tables of the database, wherein the two clusters instantiate a common table join edge;
  
  weight the cluster pairs according to a number of records that instantiate the common table join edge;
  
  filter any cluster pairs weighted below a threshold weighting, wherein the filtering includes a process selected from excluding the cluster pairs weighted below the threshold and combining clusters in each cluster pair weighted below the threshold weighting with a different cluster;
  
  select a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables;
  
  select a third table in the database, wherein the third table shares a table join edge with the first table; and
  
  determine a relative frequency, with respect to the first table, with which the second table is reached from the third table.
- View Dependent Claims (7)
- - 7. The computer system of claim 6, further comprising:
    - a storage device; and
      
      a display device,wherein the program instructions include instructions to;
      
      limit a number of table join edges in the table join structure to a predetermined maximum; and
      
      display the cluster graph on the display device.

8. The computer system of 6, wherein a sample size of each of the sample of records is based at least in part on a number of records in a corresponding table.

9. The computer system of 6, further including processor executable instructions to:
- calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk includes randomly selecting table join edges to advance between neighboring clusters, such that a given cluster is not selected more than once for the random walk.

10. The computer system of 9, further including processor executable instructions to:
- compare join relationships between the first table and the second table based on the probability.

11. The computer system of 9, wherein the instructions to determine the relative frequency include instructions to determine the relative frequency based on the probability.

12. A non-transitory computer-readable memory, including program instructions, executable by a processor, for generating cluster graphs of databases, the program instructions including instructions to:
- group records within each of a plurality of database tables into clusters based on matching nearest neighbor table join edges, wherein a first cluster associated with a first table shares a join relationship with a second cluster associated with a second table, wherein the first cluster and the second cluster form a cluster pair;
  
  identify cluster pairs for clusters in the database;
  
  weight the cluster pairs according to a number of records that instantiate the join relationship shared by the first cluster and the second cluster;
  
  apply a weighting threshold to the cluster pairs, wherein cluster pairs weighted below the weighting threshold are excluded from the ordered clusters;
  
  select a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables;
  
  select a third table in the database, wherein the third table shares a table join edge with the first table; and
  
  determine a relative frequency, with respect to the first table, with which the second table is reached from the third table.

13. The memory media of 12, wherein the program instructions include instructions to:
- calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk includes a random selection of table join edges to advance between tables, such that a given cluster is not selected more than once for the random walk.

14. The memory media of 13, further including processor executable instructions to:
- compare the join relationships between the first table and the second table based on the probability.

15. The memory media of 13, wherein the instructions executable to determine the relative frequency include:
- instructions to determine the relative frequency based on the probability.

16. The memory media of 13, wherein the program instructions include instructions to:
- repeat the program instructions to select and to calculate, wherein a different target cluster is selected from among the clusters calculate a plurality of probabilities corresponding to the plurality of tables in the database.

17. The memory media of 16, further including processor executable instructions to:
- compare join relationships between the plurality of tables in the database based on the plurality of probabilities.

18. A method for identifying meaningful join paths in a relational database having a plurality of tables, each table having a set of records, comprising:
- identifying clusters in the tables, wherein records within a cluster all instantiate a common set of neighboring join edges;
  
  identifying cluster pairs, wherein a cluster pair includes a first cluster from a first table and a second cluster from a second table and further wherein the first cluster and the second cluster share a join relationship;
  
  for at least some of the identified cluster pairs, determining a number of records in the first cluster and the second cluster that instantiate the join relationship shared;
  
  storing a cluster pair matrix indicative of the cluster pairs and their corresponding numbers of records;
  
  weighting the cluster pairs according to their corresponding number of records, wherein a threshold is applied to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are not included in the cluster pair matrix;
  
  selecting a source cluster from a first table and a target cluster from a second table, wherein the first table and second tables are different tables;
  
  selecting a third table in the database, wherein the third table shares a table join edge with the first table; and
  
  determining, based on the probability, a relative frequency, with respect to the first table, with which the second table is reached from the third table.

19. The method of 18, wherein the identifying of clusters in a selected table is performed for a representative sample of the set of records in the given table.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Procopiuc, Cecilia Magdalena, Srivastava, Divesh
Primary Examiner(s)
NGUYEN, PHONG H

Application Number

US12/463,161
Publication Number

US 20100274785A1
Time in Patent Office

1,075 Days
Field of Search

707/737, 707/748, 707/798
US Class Current

707/737
CPC Class Codes

G06F 16/22 Indexing; Data structures t...

Database analysis using clusters

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

147 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Database analysis using clusters

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

147 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others