System for analysing data relationships to support data query execution

US 10,691,651 B2
Filed: 09/14/2017
Issued: 06/23/2020
Est. Priority Date: 09/15/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of identifying relationships between data tables, each data table comprising a plurality of data records, the method comprising:

evaluating a plurality of candidate relationships, each candidate relationship defined between a first column associated with a first data table and a second column associated with a second data table, the evaluating comprising computing relationship metrics for each candidate relationship, wherein the relationship metrics for a candidate relationship provide a measure of a relationship between the first column and the second column, the computing comprising;

computing a first metric indicating a degree of distinctness of values of at least one of the first and second columns;

computing a second metric indicating a measure of overlap between values of the first column and values of the second column;

the method further comprising identifying one or more relationships between data tables in dependence on the computed relationship metrics;

wherein the method is performed in a plurality of processing stages including;

a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values;

a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and

a third processing stage, comprising computing the relationship metrics based on the output of the second processing stage.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and software tool for identifying relationships between columns of one or more data tables are disclosed. In the disclosed method, a relationship indicator is computed for each of a plurality of column pairs, each column pair comprising respective first and second columns selected from the one or more data tables. The relationship indicator comprises a measure of a relationship (e.g. indicating a strength or likelihood of a relationship) between data of the first column and data of the second column. Relationships between columns of the data tables are then identified in dependence on the computed relationship indicators. The identified relationships may be used to create and execute data queries.

Citations

20 Claims

1. A computer-implemented method of identifying relationships between data tables, each data table comprising a plurality of data records, the method comprising:
- evaluating a plurality of candidate relationships, each candidate relationship defined between a first column associated with a first data table and a second column associated with a second data table, the evaluating comprising computing relationship metrics for each candidate relationship, wherein the relationship metrics for a candidate relationship provide a measure of a relationship between the first column and the second column, the computing comprising;
  
  computing a first metric indicating a degree of distinctness of values of at least one of the first and second columns;
  
  computing a second metric indicating a measure of overlap between values of the first column and values of the second column;
  
  the method further comprising identifying one or more relationships between data tables in dependence on the computed relationship metrics;
  
  wherein the method is performed in a plurality of processing stages including;
  
  a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values;
  
  a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and
  
  a third processing stage, comprising computing the relationship metrics based on the output of the second processing stage.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. A method according to claim 1, wherein the first and second columns define respective first and second candidate keys of the respective data tables.
  - 3. A method according to claim 1, comprising computing a relationship indicator for one or more of the candidate relationships, wherein the relationship indicator for a candidate relationship is indicative of a strength or likelihood of a relationship between the first and second columns forming the candidate relationship and is computed based on the first and second metric for the candidate relationship.
  - 4. A method according to claim 1, wherein the first metric comprises a key probability indicator indicative of the probability of the first column or second column being a primary key for its data collection.
  - 5. A method according to claim 4, wherein computing a key probability indicator comprises:
    - computing, for the first and second columns, respective first and second probability indicators indicative of the probability of the respective columns being a primary key for its data collection, anddetermining the key probability indicator for the candidate relationship based on the first and second probability indicators.
  - 6. A method according to claim 5, comprising computing the key probability for the candidate relationship as the greater of the first and second probability indicators.
  - 7. A method according to claim 4, comprising determining a probability that a column defines a primary key for its data table based on a ratio between a number of distinct values of the column and a total number of values of the column.
  - 8. A method according to claim 1, wherein the second metric comprises an intersection indicator indicative of a degree of intersection between values of the first and second columns.
  - 9. A method according to claim 8, wherein computing the intersection indicator comprises:
    - computing a number of distinct intersecting values between the first and second columns, wherein the intersecting values are values appearing in both the first and second columns;
      
      computing the intersection indicator for the candidate relationship based on a ratio between the number of distinct intersecting values and a total number of distinct values of the first or second column.
  - 10. A method according to claim 9, comprising:
    - computing a first ratio between the number of distinct intersecting values and the total number of distinct values of the first column;
      
      computing a second ratio between the number of distinct intersecting values and the total number of distinct values of the second column; and
      
      computing the intersection indicator in dependence on the first and second ratios.
  - 11. A method according to claim 10, comprising computing the intersection indicator as the greater of the first and second ratios.
  - 12. A method according to claim 3, comprising computing the relationship indicator for a candidate relationship based on the product of key probability indicator and intersection indicator.
  - 13. A method according to claim 1, wherein identifying one or more relationships comprises identifying a possible relationship between columns of respective data tables in response to one or more of the first metric, the second metric and the relationship indicator for a candidate relationship exceeding a respective predetermined threshold.
  - 14. A method according to claim 1, comprising ranking a plurality of candidate relationships in accordance with their relationship indicators, and/or associating a rank value with the candidate relationships.
  - 15. A method according to claim 1, the identifying step comprising generating an output data set comprising information identifying one or more identified relationships, the output data preferably including computed relationship indicators, metrics and/or ranks.
  - 16. A method according to claim 1,wherein one or more of the first, second and third processing stages are executed by a plurality of computing nodes operating in parallel.
  - 17. A method according to claim 16, implemented as a map-reduce algorithm, preferably wherein the first processing stage is implemented using a map operation and the second processing stage is implemented as a reduce operation.
  - 18. A method according to claim 1, comprising using at least one of the identified relationships in the creation and/or execution of a data query to retrieve data from the one or more data tables, the data query preferably specifying a join defined between respective keys of the data tables, the keys corresponding to the columns between which the relationship is defined.

19. A data processing system comprising:
- data storage for storing data tables; and
  
  a table analyser module configured to;
  
  compute a relationship indicator for each of a plurality of column pairs, each column pair comprising respective first and second candidate key columns selected from the one or more data tables, wherein the relationship indicator comprises a measure of a relationship between data of the first candidate column and data of the second candidate column;
  
  the relationship indicator computed based on a measure of distinctness of values of at least one of the first and second candidate key columns and/or based on a measure of overlap between values of the first candidate column and values of the second candidate column; and
  
  output data specifying one or more relationships between column pairs of the data tables in dependence on the computed relationship indicators;
  
  wherein the table analyser module is configured to execute a plurality of processing stages for computing the relationship indicators, including;
  
  a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values;
  
  a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and
  
  a third processing stage, comprising computing the relationship indicators based on the output of the second processing stage.

20. A tangible computer-readable medium comprising software code adapted, when executed on a data processing apparatus, to perform a method of identifying relationships between data tables, each data table comprising a plurality of data records, the method comprising:
- evaluating a plurality of candidate relationships, each candidate relationship defined between a first column associated with a first data table and a second column associated with a second data table, the evaluating comprising computing relationship metrics for each candidate relationship, wherein the relationship metrics for a candidate relationship provide a measure of a relationship between the first column and the second column, the computing comprising;
  
  computing a first metric indicating a degree of distinctness of values of at least one of the first and second columns;
  
  computing a second metric indicating a measure of overlap between values of the first column and values of the second column;
  
  the method further comprising identifying one or more relationships between data tables in dependence on the computed relationship metrics;
  
  wherein the method is performed in a plurality of processing stages including;
  
  a first processing stage, comprising generating a map table which maps values appearing in the data tables to column locations of those data values;
  
  a second processing stage, comprising computing numbers of distinct data values for respective columns and numbers of distinct intersecting values for respective column pairs, the second processing stage comprising processing a plurality of partitions of the map table in parallel; and
  
  a third processing stage, comprising computing the relationship metrics based on the output of the second processing stage.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hitachi Vantara, LLC (Hitachi, Ltd.)
Original Assignee
GB Gas Holdings Limited (Centrica PLC)
Inventors
Harrison, Stephen, Rehal, Daljit
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
Bogacki, Michal

Application Number

US15/704,569
Publication Number

US 20180096000A1
Time in Patent Office

1,013 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/211   Schema design and management

G06F 16/2423   Interactive query statement...

G06F 16/24578   using ranking

G06F 16/254   Extract, transform and load...

G06F 16/288   Entity relationship models

System for analysing data relationships to support data query execution

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System for analysing data relationships to support data query execution

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links