Techniques for relationship discovery between datasets
First Claim
1. A method comprising, at a computer system:
- generating first profile metadata for each column of a first plurality of columns in a first dataset stored a first data source;
generating second profile metadata for each column of a second plurality of columns in a second dataset stored a second data source;
identifying a plurality of column pairs between the first dataset and the second dataset, wherein each column pair in the plurality of column pairs includes a different one of the first plurality of columns and a different one of the second plurality of columns;
determining one or more column pairs from the plurality of identified column pairs to exclude;
excluding at least one column pair from the one or more determined column pairs;
for each of the one or more column pairs remaining after the excluding step;
based on a type of join specified via a graphical interface, computing a plurality of scores for the column pair, each of the plurality of scores computed based on a different one of a plurality of scoring functions, the score indicating a measure for joining columns in the column pair;
computing a plurality of weighted scores, each of the plurality of weighted scores computed for a different one of the plurality of scores based on applying one of a plurality of weights to the different one of the plurality of scores; and
determining a pair score for the column pair, the pair score being a summation of the plurality of weighted scores;
based on the pair score for each of the one or more column pairs, selecting a first column pair from the one or more column pairs;
generating a third dataset based on merging, according to the type of join, the first dataset at a first column within the first column pair with the second dataset at a second column in the first column pair; and
generating the graphical interface to display the generated third dataset.
1 Assignment
0 Petitions
Accused Products
Abstract
The present disclosure related to techniques for analyzing data from multiple different data sources to determine a relationship between the data (also referred to herein a “data relationship discovery”). The relationships between any two compared datasets may be used to determine one or more recommendations for merging (e.g., joining), or “blending,” the data sets together. Relationship discovery may include determining a relationship between a subset of data, such as a relationship between a pair of columns, or column pair, each column in a different dataset of the datasets that are compared. Given two datasets to process for relationship discovery, relationship discovery may identify and recommends a ranked subset of column pairs between two compared datasets. The ranked column pairs identified as a relationship may be useful for blending the datasets with respect to those column pairs.
40 Citations
20 Claims
-
1. A method comprising, at a computer system:
-
generating first profile metadata for each column of a first plurality of columns in a first dataset stored a first data source; generating second profile metadata for each column of a second plurality of columns in a second dataset stored a second data source; identifying a plurality of column pairs between the first dataset and the second dataset, wherein each column pair in the plurality of column pairs includes a different one of the first plurality of columns and a different one of the second plurality of columns; determining one or more column pairs from the plurality of identified column pairs to exclude; excluding at least one column pair from the one or more determined column pairs; for each of the one or more column pairs remaining after the excluding step; based on a type of join specified via a graphical interface, computing a plurality of scores for the column pair, each of the plurality of scores computed based on a different one of a plurality of scoring functions, the score indicating a measure for joining columns in the column pair; computing a plurality of weighted scores, each of the plurality of weighted scores computed for a different one of the plurality of scores based on applying one of a plurality of weights to the different one of the plurality of scores; and determining a pair score for the column pair, the pair score being a summation of the plurality of weighted scores; based on the pair score for each of the one or more column pairs, selecting a first column pair from the one or more column pairs; generating a third dataset based on merging, according to the type of join, the first dataset at a first column within the first column pair with the second dataset at a second column in the first column pair; and generating the graphical interface to display the generated third dataset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more processors; and a memory accessible to the one or more processors, the memory comprising instructions that, when executed by the one or more processors, cause the one or more processors to; generate first profile metadata for each column of a first plurality of columns in a first dataset stored a first data source; generate second profile metadata for each column of a second plurality of columns in a second dataset stored a second data source; identify a plurality of column pairs between the first dataset and the second dataset, wherein each column pair in the plurality of column pairs includes a different one of the first plurality of columns and a different one of the second plurality of columns; determine one or more column pairs from the plurality of identified column pairs to exclude; exclude at least one column pair from the one or more determined column pairs; for each of the one or more column pairs remaining after the excluding step; based on a type of join specified via a graphical interface, compute a plurality of scores for the column pair, each of the plurality of scores computed based on a different one of a plurality of scoring functions, the score indicating a measure for joining columns in the column pair; compute a plurality of weighted scores, each of the plurality of weighted scores computed for a different one of the plurality of scores based on applying one of a plurality of weights to the different one of the plurality of scores; and determine a pair score for the column pair, the pair score being a summation of the plurality of weighted scores; based on the pair score for each of the one or more column pairs, select a first column pair from the one or more column pairs; generate a third dataset based on merging, according to the type of join, the first dataset at a first column within the first column pair with the second dataset at a second column in the first column pair; and generate the graphical interface to display the generated third dataset. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A non-transitory computer readable medium storing one or more instructions that are executable by one or more processors to cause the one or more processors to:
-
generate first profile metadata for each column of a first plurality of columns in a first dataset stored a first data source; generate second profile metadata for each column of a second plurality of columns in a second dataset stored a second data source; identify a plurality of column pairs between the first dataset and the second dataset, wherein each column pair in the plurality of column pairs includes a different one of the first plurality of columns and a different one of the second plurality of columns; determine one or more column pairs from the plurality of identified column pairs; exclude at least one column pair from the one or more determined column pairs; for each of the one or more column pairs remaining after the excluding step; based on a type of join specified via a graphical interface, compute a plurality of scores for the column pair, each of the plurality of scores computed based on a different one of a plurality of scoring functions, the score indicating a measure for joining columns in the column pair; compute a plurality of weighted scores, each of the plurality of weighted scores computed for a different one of the plurality of scores based on applying one of a plurality of weights to the different one of the plurality of scores; and determine a pair score for the column pair, the pair score being a summation of the plurality of weighted scores; based on the pair score for each of the one or more column pairs, select a first column pair from the one or more column pairs; generate a third dataset based on merging, according to the type of join, the first dataset at a first column within the first column pair with the second dataset at a second column in the first column pair; and generate the graphical interface to display the generated third dataset. - View Dependent Claims (18, 19, 20)
-
Specification