×

Providing full data provenance visualization for versioned datasets

  • US 9,996,595 B2
  • Filed: 08/03/2015
  • Issued: 06/12/2018
  • Est. Priority Date: 08/03/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method, comprising:

  • at one or more computing devices having one or more processors and memory storing one or more programs executed by the one or more processors to perform the method, performing the operations of;

    storing an input dataset and provenance metadata identifying one or more previous versions of the input dataset;

    using a derivation program, transforming data in the input dataset and storing the transformed data as a versioned dataset;

    updating the provenance metadata to identify the input dataset in addition to the one or more previous versions of the input dataset;

    receiving selection of the versioned dataset that is within a data pipeline system;

    determining full data provenance of the selected versioned dataset, the full data provenance comprising a set of versioned datasets, by identifying, in the provenance metadata, at least the input dataset and the one or more previous versions of the input dataset;

    providing for display of a visualization of the full data provenance of the selected versioned dataset, the visualization comprising a graph, the graph comprising a compound node for the selected versioned dataset and a compound node for each versioned dataset in the set of versioned datasets, the graph further comprising edges connecting the compounds nodes, each edge representing a derivation dependency between versions of the versioned datasets represented by the compound nodes connected by the edge;

    wherein a sub-entry of the compound node for a particular versioned dataset in the set of versioned datasets is visually distinguished in the graphical user interface from other compound node sub-entries of the graph to indicate that a version, of the particular version dataset represented by the sub-entry has been flagged in a database as containing invalid data;

    wherein an edge in the graph representing a derivation dependency of a first version of a first versioned dataset in the set of versioned datasets on a second version of a second versioned dataset in the set of versioned datasets is visually distinguished from other edges in the graph to indicate that the first version of the first versioned dataset potentially contains invalid data as a result of the derivation dependency.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×