×

Methods, mediums, and systems for data harmonization and data harmonization and data mapping in specified domains

  • US 10,095,716 B1
  • Filed: 04/02/2018
  • Issued: 10/09/2018
  • Est. Priority Date: 04/02/2017
  • Status: Active Grant
First Claim
Patent Images

1. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to:

  • receive an identification of a standard that defines one or more domains, the standard representing a collection of recommended names for domain variables pertaining to data associated with each respective domain;

    associate the standard with a collection of data, the collection of data comprising one or more tables, each of the one or more tables comprising one or more data variables having respective variable names;

    read the collection of data into a memory;

    canonicalize the collection of data, the canonicalizing comprising one or more of removing empty records, converting the collection of data to a same letter case, combining columns in the collection of data, or filtering categories in the collection of data to remove categories having observation counts below a predetermined minimum threshold;

    determine a first mapping of a selected table from the one or more tables to a specified domain of the one or more domains, and determining a second mapping of a selected data variable name within the selected table to a specified domain variable name within the specified domain, wherein determining the first mapping and the second mapping comprises;

    analyzing a name of the selected table and the selected data variable name to generate a plurality of n-grams of the selected table name and the selected data variable name,storing the plurality of n-grams in a dictionary,performing an analysis on the plurality of n-grams to re-scale the dictionary,evaluating a plurality of models to determine which of the plurality of models produces a highest weighted accuracy in mapping terms of the dictionary to terms of the standard,selecting one of the plurality of models based on the weighted accuracy, andidentifying the first mapping and the second mapping in the selected model;

    defining a mapping rule based on the first mapping and the second mapping;

    receive a new data set comprising one or more new tables, the one or more new tables each comprising one or more new data variables having respective new variable names; and

    use the mapping rule to predict that a selected one of the new data variables should be mapped to the specified domain variable name within the specified domain based at least in part on;

    the new variable name of the selected one of the new data variables being identical to the selected domain variable name, or the new variable name matching the selected domain variable name within a predetermined threshold closeness value.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×