Dataset connector and crawler to identify data lineage and segment data
First Claim
1. A dataset connector system comprising:
- one or more memory units storing instructions; and
one or more processors that execute the instructions to perform operations comprising;
receiving, by the dataset connector system, a plurality of datasets;
receiving, by the dataset connector system, a request to identify a cluster of connected datasets among the received plurality of datasets;
selecting, by the dataset connector system, a dataset from among the received plurality of datasets;
identifying, by a data profiling model, a data schema of the selected dataset;
determining, by the data profiling model, a statistical metric of the selected dataset;
identifying, by the data profiling model, a plurality of candidate foreign keys of the selected dataset;
determining, by a data mapping model, respective foreign key scores for individual ones of the plurality of candidate foreign keys;
generating, by the data mapping model, a plurality of edges between the selected dataset and the received plurality of datasets based on the foreign key scores, the data schema, a hierarchical relationship, and the statistical metric;
segmenting, by a data classification model, a cluster of connected datasets comprising the selected dataset, the segmenting based on the plurality of edges, wherein segmenting the datasets comprises;
labelling, by the data classification model, data in the cluster of connected datasets, the labelling indicating that associated data comprises at least one of actual data, synthetic data, or derived data; and
removing, by the data classification model, data from the connected datasets that is labelled as at least one of synthetic data or derived data;
returning, by the dataset connector, the segmented cluster of connected datasets; and
updating at least one of the data profiling model, the data mapping model, or the data classification model using the received plurality of datasets.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for connecting datasets are disclosed. For example, a system may include a memory unit storing instructions and a processor configured to execute the instructions to perform operations. The operations may include receiving a plurality of datasets and a request to identify a cluster of connected datasets among the received plurality of datasets. The operations may include selecting a dataset. In some embodiments, the operations include identifying a data schema of the selected dataset and determining a statistical metric of the selected dataset. The operations may include identifying foreign key scores. The operations may include generating a plurality of edges between the datasets based on the foreign key scores, the data schema, and the statistical metric. The operations may include segmenting and returning datasets based on the plurality of edges.
25 Citations
18 Claims
-
1. A dataset connector system comprising:
-
one or more memory units storing instructions; and one or more processors that execute the instructions to perform operations comprising; receiving, by the dataset connector system, a plurality of datasets; receiving, by the dataset connector system, a request to identify a cluster of connected datasets among the received plurality of datasets; selecting, by the dataset connector system, a dataset from among the received plurality of datasets; identifying, by a data profiling model, a data schema of the selected dataset; determining, by the data profiling model, a statistical metric of the selected dataset; identifying, by the data profiling model, a plurality of candidate foreign keys of the selected dataset; determining, by a data mapping model, respective foreign key scores for individual ones of the plurality of candidate foreign keys; generating, by the data mapping model, a plurality of edges between the selected dataset and the received plurality of datasets based on the foreign key scores, the data schema, a hierarchical relationship, and the statistical metric; segmenting, by a data classification model, a cluster of connected datasets comprising the selected dataset, the segmenting based on the plurality of edges, wherein segmenting the datasets comprises; labelling, by the data classification model, data in the cluster of connected datasets, the labelling indicating that associated data comprises at least one of actual data, synthetic data, or derived data; and removing, by the data classification model, data from the connected datasets that is labelled as at least one of synthetic data or derived data; returning, by the dataset connector, the segmented cluster of connected datasets; and updating at least one of the data profiling model, the data mapping model, or the data classification model using the received plurality of datasets. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method for connecting datasets comprising:
-
receiving, by a dataset connector system, a plurality of datasets; receiving, by the dataset connector system, a request to identify a cluster of connected datasets among the received plurality of datasets; selecting, by the dataset connector system, a dataset from among the received plurality of datasets; identifying, by a data profiling model, a data schema of the selected dataset; determining, by the data profiling model, a statistical metric of the selected dataset; identifying, by the data profiling model, a plurality of candidate foreign keys of the selected dataset; determining, by a data mapping model, respective foreign key scores for individual ones of the plurality of candidate foreign keys; generating, by the data mapping model, a plurality of edges between the selected dataset and the received plurality of datasets based on the foreign key scores, the data schema, a hierarchical relationship, and the statistical metric; segmenting, by a data classification model, a cluster of connected datasets comprising the selected dataset, the segmenting based on the plurality of edges, wherein segmenting the datasets comprises; labelling, by the data classification model, data in the cluster of connected datasets, the labelling indicating that associated data comprises at least one of actual data, synthetic data, or derived data; and removing, by the data classification model, data from the connected datasets that is labelled as at least one of synthetic data or derived data; returning, by the dataset connector, the segmented cluster of connected datasets; and updating at least one of the data profiling model, the data mapping model, or the data classification model using the received plurality of datasets.
-
-
18. A method for connecting datasets comprising:
-
receiving, by a dataset connector system, a plurality of datasets; receiving, by the dataset connector system, a request to identify a cluster of connected datasets among the received plurality of datasets; generating, by the dataset connector system, an ephemeral container instance; selecting, by the ephemeral container instance, a dataset from among the received plurality of datasets; retrieving, the ephemeral container instance, a data profiling model from a data storage; identifying, by the data profiling model, a data schema of the selected dataset; determining, by the data profiling model, a statistical metric of the selected dataset; identifying, by the data profiling model, a plurality of candidate foreign keys of the selected dataset; retrieving, by the ephemeral container instance, a data mapping model from a model storage; determining, by a data mapping model, respective foreign key scores for individual ones of the plurality of candidate foreign keys; generating, by the data mapping model, a plurality of edges between the selected dataset and the received plurality of datasets based on the foreign key scores, the data schema, a hierarchical relationship and the statistical metric; retrieving, by the ephemeral container instance, a data classification model from a model storage; segmenting, by a data classification model, a cluster of connected datasets comprising the selected dataset, the segmenting based on the plurality of edges, wherein segmenting the datasets comprises; labelling, by the data classification model, data in the cluster of connected datasets, the labelling indicating that associated data comprises at least one of actual data, synthetic data, or derived data; and removing, by the data classification model, data from the connected datasets that is labelled as at least one of synthetic data or derived data; returning, by the dataset connector, the segmented cluster of connected datasets; updating at least one of the data profiling model, the data mapping model, or the data classification model using the received plurality of datasets; and terminating, by the dataset connector system, the ephemeral container instance.
-
Specification