System for analysing data relationships to support data query execution
First Claim
1. A computer-implemented method of identifying relationships between data tables, each data table comprising a plurality of data records, the method comprising:
- evaluating a plurality of candidate relationships, each candidate relationship defined between a first column associated with a first data table and a second column associated with a second data table, the evaluating comprising computing relationship metrics for each candidate relationship, wherein the relationship metrics for a candidate relationship provide a measure of a relationship between the first column and the second column, the computing comprising;
computing a first metric indicating a degree of distinctness of values of at least one of the first and second columns;
computing a second metric indicating a measure of overlap between values of the first column and values of the second column;
the method further comprising identifying one or more relationships between data tables in dependence on the computed relationship metrics;
wherein the method is performed in a plurality of processing stages including;
a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values;
a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and
a third processing stage, comprising computing the relationship metrics based on the output of the second processing stage.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and software tool for identifying relationships between columns of one or more data tables are disclosed. In the disclosed method, a relationship indicator is computed for each of a plurality of column pairs, each column pair comprising respective first and second columns selected from the one or more data tables. The relationship indicator comprises a measure of a relationship (e.g. indicating a strength or likelihood of a relationship) between data of the first column and data of the second column. Relationships between columns of the data tables are then identified in dependence on the computed relationship indicators. The identified relationships may be used to create and execute data queries.
-
Citations
20 Claims
-
1. A computer-implemented method of identifying relationships between data tables, each data table comprising a plurality of data records, the method comprising:
-
evaluating a plurality of candidate relationships, each candidate relationship defined between a first column associated with a first data table and a second column associated with a second data table, the evaluating comprising computing relationship metrics for each candidate relationship, wherein the relationship metrics for a candidate relationship provide a measure of a relationship between the first column and the second column, the computing comprising; computing a first metric indicating a degree of distinctness of values of at least one of the first and second columns; computing a second metric indicating a measure of overlap between values of the first column and values of the second column; the method further comprising identifying one or more relationships between data tables in dependence on the computed relationship metrics; wherein the method is performed in a plurality of processing stages including; a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values; a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and a third processing stage, comprising computing the relationship metrics based on the output of the second processing stage. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A data processing system comprising:
-
data storage for storing data tables; and a table analyser module configured to; compute a relationship indicator for each of a plurality of column pairs, each column pair comprising respective first and second candidate key columns selected from the one or more data tables, wherein the relationship indicator comprises a measure of a relationship between data of the first candidate column and data of the second candidate column;
the relationship indicator computed based on a measure of distinctness of values of at least one of the first and second candidate key columns and/or based on a measure of overlap between values of the first candidate column and values of the second candidate column; andoutput data specifying one or more relationships between column pairs of the data tables in dependence on the computed relationship indicators; wherein the table analyser module is configured to execute a plurality of processing stages for computing the relationship indicators, including; a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values; a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and a third processing stage, comprising computing the relationship indicators based on the output of the second processing stage.
-
-
20. A tangible computer-readable medium comprising software code adapted, when executed on a data processing apparatus, to perform a method of identifying relationships between data tables, each data table comprising a plurality of data records, the method comprising:
-
evaluating a plurality of candidate relationships, each candidate relationship defined between a first column associated with a first data table and a second column associated with a second data table, the evaluating comprising computing relationship metrics for each candidate relationship, wherein the relationship metrics for a candidate relationship provide a measure of a relationship between the first column and the second column, the computing comprising; computing a first metric indicating a degree of distinctness of values of at least one of the first and second columns; computing a second metric indicating a measure of overlap between values of the first column and values of the second column; the method further comprising identifying one or more relationships between data tables in dependence on the computed relationship metrics; wherein the method is performed in a plurality of processing stages including; a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values; a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and a third processing stage, comprising computing the relationship metrics based on the output of the second processing stage.
-
Specification