×

Primary key-foreign key relationship determination through machine learning

  • US 10,692,015 B2
  • Filed: 07/15/2016
  • Issued: 06/23/2020
  • Est. Priority Date: 07/15/2016
  • Status: Active Grant
First Claim
Patent Images

1. A method for determining primary key-foreign key relationships among data in a plurality of tables of a target database through machine learning, the method employing a machine learning relationship determination system comprising at least one processor configured to execute computer program instructions for performing the method comprising:

  • selecting a first column of data from a first table among the tables and a second column of data from a second table among the tables for each of the tables in the target database by the machine learning relationship determination system, wherein the first column of data comprises a first column name and the second column of data comprises a second column name different from the first column name;

    identifying the selected first column of data as a prospective primary key and the selected second column of data as a prospective foreign key to form an inclusion dependency pair by the machine learning relationship determination system on determining presence of data elements of the selected second column of data in the selected first column of data in entirety;

    receiving a plurality of predetermined inclusion dependency pairs comprising primary key-foreign key pairs classified as positive training data and positive validation data, and non-primary key-foreign key pairs classified as negative training data and negative validation data, by the machine learning relationship determination system from a source database, wherein the positive validation data and the negative validation data form a validation data set;

    splitting the positive training data and the negative training data into training data sets by the machine learning relationship determination system;

    computing a plurality of primary key-foreign key features for the inclusion dependency pair of the prospective primary key and the prospective foreign key, the training data sets, and the validation data set by the machine learning relationship determination system, wherein the primary key-foreign key features for the training data sets and the validation data set are computed by the machine learning relationship determination system using one of a plurality of items selected from the group consisting of data elements of the predetermined inclusion dependency pairs, a number of unique data elements of foreign keys in the predetermined inclusion dependency pairs, Levenshtein distance between names of primary keys and the foreign keys in the predetermined inclusion dependency pairs, a prefix matching score obtained from the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, sound codes obtained by applying a Metaphone algorithm on the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, patterns of the names of the primary keys and the foreign keys in the predetermined inclusion dependency pairs, statistical measures, and any combination thereof;

    generating trained machine learning models corresponding to the training data sets by the machine learning relationship determination system by training each of one or more machine learning classification algorithms using the training data sets and the computed primary key-foreign key features of the training data sets;

    generating validated machine learning models for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system on testing the generated trained machine learning models corresponding to the training data sets with the validation data set using the computed primary key-foreign key features of the validation data set;

    determining an optimum algorithm decision threshold for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system using the generated validated machine learning models;

    determining a resultant of the inclusion dependency pair being one of a primary key-foreign key pair and a non-primary key-foreign key pair by the machine learning relationship determination system for the each of the one or more machine learning classification algorithms using the determined optimum algorithm decision threshold and the computed primary key-foreign key features of the inclusion dependency pair of the prospective primary key and the prospective foreign key; and

    performing majority voting on the determined resultant for the each of the one or more machine learning classification algorithms by the machine learning relationship determination system to determine a primary key-foreign key relationship among the data in the selected first column of data of the first table and the selected second column of data of the second table.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×