Methods, mediums, and systems for data harmonization and data harmonization and data mapping in specified domains
First Claim
1. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to:
- receive an identification of a standard that defines one or more domains, the standard representing a collection of recommended names for domain variables pertaining to data associated with each respective domain;
associate the standard with a collection of data, the collection of data comprising one or more tables, each of the one or more tables comprising one or more data variables having respective variable names;
read the collection of data into a memory;
canonicalize the collection of data, the canonicalizing comprising one or more of removing empty records, converting the collection of data to a same letter case, combining columns in the collection of data, or filtering categories in the collection of data to remove categories having observation counts below a predetermined minimum threshold;
determine a first mapping of a selected table from the one or more tables to a specified domain of the one or more domains, and determining a second mapping of a selected data variable name within the selected table to a specified domain variable name within the specified domain, wherein determining the first mapping and the second mapping comprises;
analyzing a name of the selected table and the selected data variable name to generate a plurality of n-grams of the selected table name and the selected data variable name,storing the plurality of n-grams in a dictionary,performing an analysis on the plurality of n-grams to re-scale the dictionary,evaluating a plurality of models to determine which of the plurality of models produces a highest weighted accuracy in mapping terms of the dictionary to terms of the standard,selecting one of the plurality of models based on the weighted accuracy, andidentifying the first mapping and the second mapping in the selected model;
defining a mapping rule based on the first mapping and the second mapping;
receive a new data set comprising one or more new tables, the one or more new tables each comprising one or more new data variables having respective new variable names; and
use the mapping rule to predict that a selected one of the new data variables should be mapped to the specified domain variable name within the specified domain based at least in part on;
the new variable name of the selected one of the new data variables being identical to the selected domain variable name, or the new variable name matching the selected domain variable name within a predetermined threshold closeness value.
1 Assignment
0 Petitions
Accused Products
Abstract
The techniques described herein automatically and programmatically harmonize data, and map variable names from a dataset to standards of domains for data in the dataset. Each variable may be stored in a table which holds related groups of variables. The variables may be named by defining mappings, each mapping including two mapping rules. A first mapping rule maps a domain of the standard to the table, while a second mapping rule maps a variable within the table to a variable within the domain. When a mapping rule exists that provides an exact match between a variable name and a standard, an auto-mapping feature may be applied that automatically maps the variable name to the standard. If no exact match exists, then an analysis is performed to determine the most likely mapping candidate.
-
Citations
30 Claims
-
1. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to:
-
receive an identification of a standard that defines one or more domains, the standard representing a collection of recommended names for domain variables pertaining to data associated with each respective domain; associate the standard with a collection of data, the collection of data comprising one or more tables, each of the one or more tables comprising one or more data variables having respective variable names; read the collection of data into a memory; canonicalize the collection of data, the canonicalizing comprising one or more of removing empty records, converting the collection of data to a same letter case, combining columns in the collection of data, or filtering categories in the collection of data to remove categories having observation counts below a predetermined minimum threshold; determine a first mapping of a selected table from the one or more tables to a specified domain of the one or more domains, and determining a second mapping of a selected data variable name within the selected table to a specified domain variable name within the specified domain, wherein determining the first mapping and the second mapping comprises; analyzing a name of the selected table and the selected data variable name to generate a plurality of n-grams of the selected table name and the selected data variable name, storing the plurality of n-grams in a dictionary, performing an analysis on the plurality of n-grams to re-scale the dictionary, evaluating a plurality of models to determine which of the plurality of models produces a highest weighted accuracy in mapping terms of the dictionary to terms of the standard, selecting one of the plurality of models based on the weighted accuracy, and identifying the first mapping and the second mapping in the selected model; defining a mapping rule based on the first mapping and the second mapping; receive a new data set comprising one or more new tables, the one or more new tables each comprising one or more new data variables having respective new variable names; and use the mapping rule to predict that a selected one of the new data variables should be mapped to the specified domain variable name within the specified domain based at least in part on;
the new variable name of the selected one of the new data variables being identical to the selected domain variable name, or the new variable name matching the selected domain variable name within a predetermined threshold closeness value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method performed by an electronic device, the method comprising:
-
receiving an identification of a standard that defines one or more domains, the standard representing a collection of recommended names for domain variables pertaining to data associated with each respective domain; associating the standard with a collection of data, the collection of data comprising one or more tables, each of the one or more tables comprising one or more data variables having respective variable names; reading the collection of data into a memory; canonicalizing the collection of data, the canonicalizing comprising one or more of removing empty records, converting the collection of data to a same letter case, combining columns in the collection of data, or filtering categories in the collection of data to remove categories having observation counts below a predetermined minimum threshold; determining a first mapping of a selected table from the one or more tables to a specified domain of the one or more domains, and determining a second mapping of a selected data variable name within the selected table to a specified domain variable name within the specified domain, wherein determining the first mapping and the second mapping comprises; analyzing a name of the selected table and the selected data variable name to generate a plurality of n-grams of the selected table name and the selected data variable name, storing the plurality of n-grams in a dictionary, performing an analysis on the plurality of n-grams to re-scale the dictionary, evaluating a plurality of models to determine which of the plurality of models produces a highest weighted accuracy in mapping terms of the dictionary to terms of the standard, selecting one of the plurality of models based on the weighted accuracy, and identifying the first mapping and the second mapping in the selected model; defining a mapping rule based on the first mapping and the second mapping; receiving a new data set comprising one or more new tables, the one or more new tables each comprising one or more new data variables having respective new variable names; and using the mapping rule to predict that a selected one of the new data variables should be mapped to the specified domain variable name within the specified domain based at least in part on;
the new variable name of the selected one of the new data variables being identical to the selected domain variable name, or the new variable name matching the selected domain variable name within a predetermined threshold closeness value. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. An apparatus comprising:
-
a hardware processor circuit; an interface executable on the processor circuit and configured to receive an identification of a standard that defines one or more domains, the standard representing a collection of recommended names for domain variables pertaining to data associated with each respective domain; association logic executable on the processor circuit and configured to associate the standard with a collection of data, the collection of data comprising one or more tables, each of the one or more tables comprising one or more data variables having respective variable names; storage logic executable on the processor circuit and configured to read the collection of data into a memory; canonicalization logic executable on the processor circuit and configured to canonicalize the collection of data, the canonicalizing comprising one or more of removing empty records, converting the collection of data to a same letter case, combining columns in the collection of data, or filtering categories in the collection of data to remove categories having observation counts below a predetermined minimum threshold; and mapping logic executable on the processor circuit and configured to determine a first mapping of a selected table from the one or more tables to a specified domain of the one or more domains, and determining a second mapping of a selected data variable name within the selected table to a specified domain variable name within the specified domain, wherein determining the first mapping and the second mapping comprises; analyzing a name of the selected table and the selected data variable name to generate a plurality of n-grams of the selected table name and the selected data variable name, storing the plurality of n-grams in a dictionary, performing an analysis on the plurality of n-grams to re-scale the dictionary, evaluating a plurality of models to determine which of the plurality of models produces a highest weighted accuracy in mapping terms of the dictionary to terms of the standard, selecting one of the plurality of models based on the weighted accuracy, and identifying the first mapping and the second mapping in the selected model, wherein; the interface is further configured to receive a new data set comprising one or more new tables, the one or more new tables each comprising one or more new data variables having respective new variable names, and the mapping logic is further configured to defining a mapping rule based on the first mapping and the second mapping, and to use the mapping rule to predict that a selected one of the new data variables should be mapped to the specified domain variable name within the specified domain based at least in part on;
the new variable name of the selected one of the new data variables being identical to the selected domain variable name, or the new variable name matching the selected domain variable name within a predetermined threshold closeness value. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification