Database Analysis Using Clusters

US 20100274785A1
Filed: 05/08/2009
Published: 10/28/2010
Est. Priority Date: 04/24/2009
Status: Active Grant

First Claim

Patent Images

1. A method for mapping relationships in a database, comprising:

generating a directed schema graph for the database;

grouping samples of records within tables of the directed schema graph into clusters, wherein said grouping includes approximating classes for said tables indicative of at least one of a table-reach equivalent class and a join-path equivalent class;

based on the classes, identifying cluster pairs, wherein a cluster pair represents a join relationship between clusters in two tables of the database;

generating a first mapping of the cluster pairs, wherein the cluster pairs are weighted according to a number of associated records; and

storing the first mapping in a first storage device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for mapping relationships in a database results in a cluster graph. A representative sample of records in each of a plurality of tables in the database is analyzed for nearest neighbor join edges instantiated by the record. Records with corresponding nearest neighbor join edges are grouped into clusters. Cluster pairs which share a join relationship between two tables are identified. A weighting may be applied to cluster pairs based on the number of records for the cluster pair. Meaningful cluster pairs above a weighted threshold may be ordered according to table and displayed as a cluster graph. Analyses of the cluster graph may reveal important characteristics of the database.

231 Citations

25 Claims

1. A method for mapping relationships in a database, comprising:
- generating a directed schema graph for the database;
  
  grouping samples of records within tables of the directed schema graph into clusters, wherein said grouping includes approximating classes for said tables indicative of at least one of a table-reach equivalent class and a join-path equivalent class;
  
  based on the classes, identifying cluster pairs, wherein a cluster pair represents a join relationship between clusters in two tables of the database;
  
  generating a first mapping of the cluster pairs, wherein the cluster pairs are weighted according to a number of associated records; and
  
  storing the first mapping in a first storage device.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the sample of records is a random sample of records, and wherein a sample size of the records is respectively proportional to a total number of records for the given table.
  - 3. The method of claim 1, wherein the join relationship is limited to a maximum number of total join edges.
  - 4. The method of claim 1, wherein said generating the first mapping applies a threshold to the weighting, and further wherein cluster pairs weighted below the threshold are not included in the mapping.
  - 5. The method of claim 1, wherein said generating the first mapping applies a threshold to the weighting, and further wherein clusters of the cluster pairs weighted below the threshold are combined into common clusters in individual tables of the database.
  - 6. The method of claim 1, further comprising:
    - generating a second mapping using the first mapping, wherein the cluster pairs are ordered according to the tables of the database in the second mapping; and
      
      displaying the second mapping as a cluster graph on a display device.
  - 7. The method of claim 6, further comprising:
    - selecting a source cluster and a target cluster from the second mapping, wherein the source and target cluster are in different tables of the database;
      
      calculating a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk involves random selection of join relationships to advance between neighboring clusters, such that a given cluster is not selected more than once for the random walk; and
      
      outputting the probability to a processing device.

8. A computer system for generating a schema graph of a database, comprising:
- a processor; and
  
  memory media accessible to the processor, including processor executable instructions to;
  
  determine nearest neighbor join edges for a sample of records in the tables of the database;
  
  group records within tables of the database into clusters based on corresponding nearest neighbor join edges;
  
  identify cluster pairs, wherein a cluster pair represents a join relationship between clusters in two tables of the database; and
  
  order the clusters according to the tables of the database to which the clusters belong, wherein the cluster pairs are weighted according to a number of records in clusters of a cluster pair.
- View Dependent Claims (9)
- - 9. The computer system of claim 8, further comprising:
    - a storage device; and
      
      a display device, and further including processor executable instructions to;
      
      to limit the maximum number of total join edges in the join relationship;
      
      store the ordered clusters as a cluster graph on the storage device; and
      
      display the cluster graph on the display device.

10. The computer system of 8, further including processor executable instructions to:
- apply a threshold to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are not included in the ordered clusters.

11. The computer system of 8, wherein the sample of records is a representative sample for individual tables in the database.

12. The computer system of 8, further including processor executable instructions to:
- select a source cluster from a first table and a target cluster from a second table, wherein the first and second tables are different tables in the database; and
  
  calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk involves random selection of join relationships to advance between tables, such that a given cluster is not selected more than once for the random walk.

13. The computer system of 12, further including processor executable instructions to:
- use the calculated probability to compare the join relationships between the first and second tables.

14. The computer system of 12, further including processor executable instructions to:
- select a third table in the database, wherein the third table shares a join relationship with the first table; and
  
  use the calculated probability to determine the relative frequency, with respect to the first table, with which the second table is reached from the third table.

15. Computer-readable memory media, including processor instructions for generating schema graphs of databases, executable to:
- group records within tables of the database into clusters based on matching nearest neighbor join edges for a sample of records in the tables, wherein records in a first cluster belonging to a first table share a join relationship, not exceeding a predetermined number of join edges, with a second cluster belonging to a second table, wherein the first and second clusters form a cluster pair;
  
  identify cluster pairs for clusters in the database; and
  
  order the clusters according to the tables to which the clusters belong, wherein the cluster pairs are weighted according to a number of records in clusters of a cluster pair.

16. The memory media of 15, further including processor executable instructions to:
- apply a threshold to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are excluded from the ordered clusters.

17. The memory media of 15, further including processor executable instructions to:
- apply a threshold to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are combined into a common cluster for respective tables.

18. The memory media of 15, further including processor executable instructions to:
- select a source cluster from a first table and a target cluster from a second table, wherein the first and second tables are different tables in the database; and
  
  calculate a probability that a random walk from the source cluster will reach the target cluster, wherein the random walk involves random selection of join relationships to advance between tables, such that a given cluster is not selected more than once for the random walk.

19. The memory media of 18, further including processor executable instructions to:
- use the calculated probability to compare the join relationships between the first and second tables.

20. The memory media of 18, further including processor executable instructions to:
- select a third table in the database, wherein the third table shares a join relationship with the first table; and
  
  use the calculated probability to determine the relative frequency, with respect to the first table, with which the second table is reached from the third table.

21. The memory media of 18, further including processor executable instructions to:
- repeat the instructions executable to select and to calculate, wherein a different target cluster is selected from among the clusters in a plurality of tables of the database, and wherein a plurality of probabilities corresponding to the plurality of tables in the database is calculated.

22. The memory media of 21, further including processor executable instructions to:
- use the plurality of calculated probabilities to compare the join relationships between the plurality of tables in the database.

23. A method for identifying meaningful join paths in a relational database having a plurality of tables, each table having a plurality of records, comprising:
- identifying clusters in the tables, wherein records within a cluster all instantiate a common set of neighboring join edges;
  
  identifying cluster pairs, wherein a cluster pair includes a first cluster from a first table and a second cluster from a second table and further wherein the first cluster and the second cluster share a join relationship;
  
  for at least some of the identified cluster pairs, determining a number of records in the first and second clusters that instantiate the shared relationship; and
  
  storing a cluster pair matrix indicative of the cluster pairs and their corresponding numbers of records.

24. The method of 23, wherein said identifying is performed for a representative sample of records in the tables.

25. The method of 23, further comprising:
- weighting the cluster pairs according to their corresponding number of records, wherein a threshold is applied to the weighted cluster pairs, wherein weighted cluster pairs below the threshold are not included in the cluster pair matrix.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Original Assignee
AT&T Intellectual Property I LP (AT&T, Inc.)
Inventors
Procopiuc, Cecilia Magdalena, Srivastava, Divesh

Granted Patent

US 8,161,048 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/22 Indexing; Data structures t...

Database Analysis Using Clusters

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

231 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Database Analysis Using Clusters

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

231 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links