Primary key-foreign key relationship determination through machine learning
First Claim
1. A method for determining primary key-foreign key relationships among data in a plurality of tables of a target database through machine learning, the method employing a machine learning relationship determination system comprising at least one processor configured to execute computer program instructions for performing the method comprising:
- selecting a first column of data from a first table among the tables and a second column of data from a second table among the tables for each of the tables in the target database by the machine learning relationship determination system, wherein the first column of data comprises a first column name and the second column of data comprises a second column name different from the first column name;
identifying the selected first column of data as a prospective primary key and the selected second column of data as a prospective foreign key to form an inclusion dependency pair by the machine learning relationship determination system on determining presence of data elements of the selected second column of data in the selected first column of data in entirety;
receiving a plurality of predetermined inclusion dependency pairs comprising primary key-foreign key pairs classified as positive training data and positive validation data, and non-primary key-foreign key pairs classified as negative training data and negative validation data, by the machine learning relationship determination system from a source database, wherein the positive validation data and the negative validation data form a validation data set;
splitting the positive training data and the negative training data into training data sets by the machine learning relationship determination system;
computing a plurality of primary key-foreign key features for the inclusion dependency pair of the prospective primary key and the prospective foreign key, the training data sets, and the validation data set by the machine learning relationship determination system, wherein the primary key-foreign key features for the training data sets and the validation data set are computed by the machine learning relationship determination system using one of a plurality of items selected from the group consisting of data elements of the predetermined inclusion dependency pairs, a number of unique data elements of foreign keys in the predetermined inclusion dependency pairs, Levenshtein distance between names of primary keys and the foreign keys in the predetermined inclusion dependency pairs, a prefix matching score obtained from the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, sound codes obtained by applying a Metaphone algorithm on the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, patterns of the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, statistical measures, and any combination thereof;
generating trained machine learning models corresponding to the training data sets by the machine learning relationship determination system by training each of one or more machine learning classification algorithms using the training data sets and the computed primary key-foreign key features of the training data sets;
generating validated machine learning models for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system on testing the generated trained machine learning models corresponding to the training data sets with the validation data set using the computed primary key-foreign key features of the validation data set;
determining an optimum algorithm decision threshold for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system using the generated validated machine learning models;
determining a resultant of the inclusion dependency pair being one of a primary key-foreign key pair and a non-primary key-foreign key pair by the machine learning relationship determination system for the each of the one or more machine learning classification algorithms using the determined optimum algorithm decision threshold and the computed primary key-foreign key features of the inclusion dependency pair of the prospective primary key and the prospective foreign key; and
performing majority voting on the determined resultant for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system to determine a primary key-foreign key relationship among the data in the selected first column of data of the first table and the selected second column of data of the second table.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and a machine learning relationship determination system (MLRDS) for determining primary key-foreign key (PK-FK) relationships among data in tables of a target database through machine learning (ML) are provided. The MLRDS selects columns of the tables in the target database and identifies inclusion dependency (ID) pairs from the selected columns. The MLRDS receives training data and validation data from a source database, computes PK-FK features for the inclusion dependency pairs, the training data, and the validation data, and generates trained ML models and validated ML models using the PK-FK features. The MLRDS determines an optimum algorithm decision threshold for a selected machine learning classification algorithm (MLCA), using which the MLRDS determines a resultant on whether the inclusion dependency pair is a PK-FK pair or a non-PK-FK pair. The MLRDS performs majority voting on the resultant for multiple MLCAs to confirm the PK-FK relationships between the inclusion dependency pairs.
-
Citations
19 Claims
-
1. A method for determining primary key-foreign key relationships among data in a plurality of tables of a target database through machine learning, the method employing a machine learning relationship determination system comprising at least one processor configured to execute computer program instructions for performing the method comprising:
-
selecting a first column of data from a first table among the tables and a second column of data from a second table among the tables for each of the tables in the target database by the machine learning relationship determination system, wherein the first column of data comprises a first column name and the second column of data comprises a second column name different from the first column name; identifying the selected first column of data as a prospective primary key and the selected second column of data as a prospective foreign key to form an inclusion dependency pair by the machine learning relationship determination system on determining presence of data elements of the selected second column of data in the selected first column of data in entirety; receiving a plurality of predetermined inclusion dependency pairs comprising primary key-foreign key pairs classified as positive training data and positive validation data, and non-primary key-foreign key pairs classified as negative training data and negative validation data, by the machine learning relationship determination system from a source database, wherein the positive validation data and the negative validation data form a validation data set; splitting the positive training data and the negative training data into training data sets by the machine learning relationship determination system; computing a plurality of primary key-foreign key features for the inclusion dependency pair of the prospective primary key and the prospective foreign key, the training data sets, and the validation data set by the machine learning relationship determination system, wherein the primary key-foreign key features for the training data sets and the validation data set are computed by the machine learning relationship determination system using one of a plurality of items selected from the group consisting of data elements of the predetermined inclusion dependency pairs, a number of unique data elements of foreign keys in the predetermined inclusion dependency pairs, Levenshtein distance between names of primary keys and the foreign keys in the predetermined inclusion dependency pairs, a prefix matching score obtained from the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, sound codes obtained by applying a Metaphone algorithm on the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, patterns of the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, statistical measures, and any combination thereof; generating trained machine learning models corresponding to the training data sets by the machine learning relationship determination system by training each of one or more machine learning classification algorithms using the training data sets and the computed primary key-foreign key features of the training data sets; generating validated machine learning models for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system on testing the generated trained machine learning models corresponding to the training data sets with the validation data set using the computed primary key-foreign key features of the validation data set; determining an optimum algorithm decision threshold for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system using the generated validated machine learning models; determining a resultant of the inclusion dependency pair being one of a primary key-foreign key pair and a non-primary key-foreign key pair by the machine learning relationship determination system for the each of the one or more machine learning classification algorithms using the determined optimum algorithm decision threshold and the computed primary key-foreign key features of the inclusion dependency pair of the prospective primary key and the prospective foreign key; and performing majority voting on the determined resultant for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system to determine a primary key-foreign key relationship among the data in the selected first column of data of the first table and the selected second column of data of the second table. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for determining primary key foreign key relationships among data in a plurality of tables of a target database, the system comprising:
-
a non-transitory computer readable storage medium configured to store computer program instructions; and at least one processor connected to the non-transitory computer readable storage medium, the computer program instructions when executed by the at least one processor configure the system to; select a first column of data from a first table among the tables and a second column of data from a second table among the tables for each of the tables in the target database, wherein the first column of data comprises a first column name and the second column of data comprises a second column name different from the first column name; identify the selected first column of data as a prospective primary key and the selected second column of data as a prospective foreign key to form an inclusion dependency pair on determining presence of data elements of the selected second column of data in the selected first column of data in entirety; receive a plurality of predetermined inclusion dependency pairs comprising primary key-foreign key pairs classified as positive training data and positive validation data, and non primary key-foreign key pairs classified as negative training data and negative validation data, from a source database, wherein the positive validation data and the negative validation data form a validation data set; split the positive training data and the negative training data into training data sets; compute a plurality of primary key-foreign key features for the inclusion dependency pair of the prospective primary key and the prospective foreign key, the training data sets, and the validation data set, wherein the system is further configured to compute the primary key-foreign key features for the training data sets and the validation data set using one of a plurality of items selected from the group consisting of data elements of the predetermined inclusion dependency pairs, a number of unique data elements of foreign keys in the predetermined inclusion dependency pairs, Levenshtein distance between names of primary keys and the foreign keys in the predetermined inclusion dependency pairs, a prefix matching score obtained from the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, sound codes obtained by applying a Metaphone algorithm on the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, patterns of the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, statistical measures, and any combination thereof; generate trained machine learning models corresponding to the training data sets by training each of one or more machine learning classification algorithms using the training data sets and the computed primary key-foreign key features of the training data sets; generate validated machine learning models for the each of the one or more machine learning classification algorithms on testing the generated trained machine learning models corresponding to the training data sets with the validation data set using the computed primary key-foreign key features of the validation data set; determine an optimum algorithm decision threshold for the each of the one or more machine learning classification algorithms using the generated validated machine learning models; determine a resultant of the inclusion dependency pair being one of a primary key foreign key pair and a non-primary key-foreign key pair for the each of the one or more machine learning classification algorithms using the determined optimum algorithm decision threshold and the computed primary key foreign key features of the inclusion dependency pair of the prospective primary key and the prospective foreign key; and perform majority voting on the determined resultant for the each of the one or more machine learning classification algorithms to determine a primary key-foreign key relationship among the data in the selected first column of data of the first table and the selected second column of data of the second table. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium having stored thereon instructions for causing one or more processing units to execute a process for determining primary key-foreign key relationships among data in a plurality of tables of a target database, the process comprising:
-
selecting a first column of data from a first table among the tables and a second column of data from a second table among the tables for each of the tables in the target database, wherein the first column of data comprises a first column name and the second column of data comprises a second column name different from the first column name; identifying the selected first column of data as a prospective primary key and the selected second column of data as a prospective foreign key to form an inclusion dependency pair on determining presence of data elements of the selected second column of data in the selected first column of data in entirety; receiving a plurality of predetermined inclusion dependency pairs comprising primary key-foreign key pairs classified as positive training data and positive validation data, and non-primary key-foreign key pairs classified as negative training data and negative validation data, from a source database, wherein the positive validation data and the negative validation data form a validation data set; splitting the positive training data and the negative training data into training data sets; computing a plurality of primary key-foreign key features for the inclusion dependency pair of the prospective primary key and the prospective foreign key, the training data sets, and the validation data set, wherein the primary key-foreign key features for the training data sets and the validation data set are computed by the machine learning relationship determination system using one of a plurality of items selected from the group consisting of data elements of the predetermined inclusion dependency pairs, a number of unique data elements of foreign keys in the predetermined inclusion dependency pairs, Levenshtein distance between names of primary keys and the foreign keys in the predetermined inclusion dependency pairs, a prefix matching score obtained from the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, sound codes obtained by applying a Metaphone algorithm on the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, patterns of the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, statistical measures, and any combination thereof; generating trained machine learning models corresponding to the training data sets by training each of one or more machine learning classification algorithms using the training data sets and the computed primary key-foreign key features of the training data sets; generating validated machine learning models for the each of the one or more machine learning classification algorithms on testing the generated trained machine learning models corresponding to the training data sets with the validation data set using the computed primary key foreign key features of the validation data set; determining an optimum algorithm decision threshold for the each of the one or more machine learning classification algorithms using the generated validated machine learning models; determining a resultant of the inclusion dependency pair being one of a primary key-foreign key pair and a non-primary key-foreign key pair for the each of the one or more machine learning classification algorithms using the determined optimum algorithm decision threshold and the computed primary key-foreign key features of the inclusion dependency pair of the prospective primary key and the prospective foreign key; and performing majority voting on the determined resultant for the each of the one or more machine learning classification algorithms to determine a primary key-foreign key relationship among the data in the selected first column of data of the first table and the selected second column of data of the second table. - View Dependent Claims (16, 17, 18)
-
-
19. A method for determining primary key-foreign key relationships among data in a plurality of tables of a target database through machine learning, the method employing a machine learning relationship determination system comprising at least one processor configured to execute computer program instructions for performing the method comprising:
-
selecting a first column of data from a first table among the tables and a second column of data from a second table among the tables for each of the tables in the target database by the machine learning relationship determination system, wherein the first column of data comprises a first column name and the second column of data comprises a second column name different from the first column name; identifying the selected first column of data as a prospective primary key and the selected second column of data as a prospective foreign key to form an inclusion dependency pair by the machine learning relationship determination system on determining presence of data elements of the selected second column of data in the selected first column of data in entirety; receiving a plurality of predetermined inclusion dependency pairs comprising primary key-foreign key pairs classified as positive training data and positive validation data, and non-primary key-foreign key pairs classified as negative training data and negative validation data, by the machine learning relationship determination system from a source database, wherein the positive validation data and the negative validation data form a validation data set; splitting the positive training data and the negative training data into training data sets by the machine learning relationship determination system; computing a plurality of primary key-foreign key features for the inclusion dependency pair of the prospective primary key and the prospective foreign key, the training data sets, and the validation data set by the machine learning relationship determination system, wherein the plurality of primary key-foreign key features for the inclusion dependency pair of the prospective primary key and the prospective foreign key are computed by the machine learning relationship determination system using one of a plurality of items selected from the group consisting of data elements of the prospective primary key identified by the selected first column of data, the data elements of the prospective foreign key identified by the selected second column of data, a number of unique data elements of the prospective foreign key, Levenshtein distance between names of the prospective primary key and the prospective foreign key, a prefix matching score obtained from the names of the prospective primary key and the prospective foreign key, sound codes obtained by applying a Metaphone algorithm on the names of the prospective primary key and the prospective foreign key, patterns of the names of the prospective primary key and the prospective foreign key, statistical measures, and any combination thereof; generating trained machine learning models corresponding to the training data sets by the machine learning relationship determination system by training each of one or more machine learning classification algorithms using the training data sets and the computed primary key-foreign key features of the training data sets; generating validated machine learning models for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system on testing the generated trained machine learning models corresponding to the training data sets with the validation data set using the computed primary key-foreign key features of the validation data set; determining an optimum algorithm decision threshold for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system using the generated validated machine learning models; determining a resultant of the inclusion dependency pair being one of a primary key-foreign key pair and a non-primary key-foreign key pair by the machine learning relationship determination system for the each of the one or more machine learning classification algorithms using the determined optimum algorithm decision threshold and the computed primary key-foreign key features of the inclusion dependency pair of the prospective primary key and the prospective foreign key; and performing majority voting on the determined resultant for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system to determine a primary key-foreign key relationship among the data in the selected first column of data of the first table and the selected second column of data of the second table.
-
Specification