×

Dataset connector and crawler to identify data lineage and segment data

  • US 10,459,954 B1
  • Filed: 01/18/2019
  • Issued: 10/29/2019
  • Est. Priority Date: 07/06/2018
  • Status: Active Grant
First Claim
Patent Images

1. A dataset connector system comprising:

  • one or more memory units storing instructions; and

    one or more processors that execute the instructions to perform operations comprising;

    receiving, by the dataset connector system, a plurality of datasets;

    receiving, by the dataset connector system, a request to identify a cluster of connected datasets among the received plurality of datasets;

    selecting, by the dataset connector system, a dataset from among the received plurality of datasets;

    identifying, by a data profiling model, a data schema of the selected dataset;

    determining, by the data profiling model, a statistical metric of the selected dataset;

    identifying, by the data profiling model, a plurality of candidate foreign keys of the selected dataset;

    determining, by a data mapping model, respective foreign key scores for individual ones of the plurality of candidate foreign keys;

    generating, by the data mapping model, a plurality of edges between the selected dataset and the received plurality of datasets based on the foreign key scores, the data schema, a hierarchical relationship, and the statistical metric;

    segmenting, by a data classification model, a cluster of connected datasets comprising the selected dataset, the segmenting based on the plurality of edges, wherein segmenting the datasets comprises;

    labelling, by the data classification model, data in the cluster of connected datasets, the labelling indicating that associated data comprises at least one of actual data, synthetic data, or derived data; and

    removing, by the data classification model, data from the connected datasets that is labelled as at least one of synthetic data or derived data;

    returning, by the dataset connector, the segmented cluster of connected datasets; and

    updating at least one of the data profiling model, the data mapping model, or the data classification model using the received plurality of datasets.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×