Providing full data provenance visualization for versioned datasets
First Claim
1. A method, comprising:
- at one or more computing devices having one or more processors and memory storing one or more programs executed by the one or more processors to perform the method, performing the operations of;
storing an input dataset and provenance metadata identifying one or more previous versions of the input dataset;
using a derivation program, transforming data in the input dataset and storing the transformed data as a versioned dataset;
updating the provenance metadata to identify the input dataset in addition to the one or more previous versions of the input dataset;
receiving selection of the versioned dataset that is within a data pipeline system;
determining full data provenance of the selected versioned dataset, the full data provenance comprising a set of versioned datasets, by identifying, in the provenance metadata, at least the input dataset and the one or more previous versions of the input dataset;
providing for display of a visualization of the full data provenance of the selected versioned dataset, the visualization comprising a graph, the graph comprising a compound node for the selected versioned dataset and a compound node for each versioned dataset in the set of versioned datasets, the graph further comprising edges connecting the compounds nodes, each edge representing a derivation dependency between versions of the versioned datasets represented by the compound nodes connected by the edge;
wherein a sub-entry of the compound node for a particular versioned dataset in the set of versioned datasets is visually distinguished in the graphical user interface from other compound node sub-entries of the graph to indicate that a version, of the particular version dataset represented by the sub-entry has been flagged in a database as containing invalid data;
wherein an edge in the graph representing a derivation dependency of a first version of a first versioned dataset in the set of versioned datasets on a second version of a second versioned dataset in the set of versioned datasets is visually distinguished from other edges in the graph to indicate that the first version of the first versioned dataset potentially contains invalid data as a result of the derivation dependency.
8 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for providing full data provenance visualization for versioned datasets. A method includes receiving selection of a versioned dataset that is within a data pipeline system. The method also includes determining the full data provenance of the selected versioned dataset. The full data provenance may comprise a set of versioned datasets. The method further includes providing for display of a visualization of the full data provenance of the selected versioned dataset. The visualization comprises a graph. The graph comprises a compound node for the selected versioned dataset and for each versioned dataset in the set of versioned datasets. The graph further comprises edges connecting the compounds nodes. Each edge represents a derivation dependency between versions of the versioned datasets represented by the compound nodes connected by the edge.
-
Citations
13 Claims
-
1. A method, comprising:
-
at one or more computing devices having one or more processors and memory storing one or more programs executed by the one or more processors to perform the method, performing the operations of; storing an input dataset and provenance metadata identifying one or more previous versions of the input dataset; using a derivation program, transforming data in the input dataset and storing the transformed data as a versioned dataset; updating the provenance metadata to identify the input dataset in addition to the one or more previous versions of the input dataset; receiving selection of the versioned dataset that is within a data pipeline system; determining full data provenance of the selected versioned dataset, the full data provenance comprising a set of versioned datasets, by identifying, in the provenance metadata, at least the input dataset and the one or more previous versions of the input dataset; providing for display of a visualization of the full data provenance of the selected versioned dataset, the visualization comprising a graph, the graph comprising a compound node for the selected versioned dataset and a compound node for each versioned dataset in the set of versioned datasets, the graph further comprising edges connecting the compounds nodes, each edge representing a derivation dependency between versions of the versioned datasets represented by the compound nodes connected by the edge; wherein a sub-entry of the compound node for a particular versioned dataset in the set of versioned datasets is visually distinguished in the graphical user interface from other compound node sub-entries of the graph to indicate that a version, of the particular version dataset represented by the sub-entry has been flagged in a database as containing invalid data; wherein an edge in the graph representing a derivation dependency of a first version of a first versioned dataset in the set of versioned datasets on a second version of a second versioned dataset in the set of versioned datasets is visually distinguished from other edges in the graph to indicate that the first version of the first versioned dataset potentially contains invalid data as a result of the derivation dependency. - View Dependent Claims (2, 3, 4, 5)
-
-
6. One or more non-transitory computer-readable media storing one or more programs, the one or more programs comprising instructions for:
-
storing an input dataset and provenance metadata identifying one or more previous versions of the input dataset; using a derivation program, transforming data in the input dataset and storing the transformed data as a versioned dataset; updating the provenance metadata to identify the input dataset in addition to the one or more previous versions of the input dataset; receiving selection of a versioned dataset that is within a data pipeline system; determining full data provenance of the selected versioned dataset, the full data provenance comprising a set of versioned datasets, by identifying, in the provenance metadata, at least the input dataset and the one or more previous versions of the input dataset; providing for display of a visualization of the full data provenance of the selected versioned dataset, the visualization comprising a graph, the graph comprising a compound node for the selected versioned dataset and for each versioned dataset in the set of versioned datasets, the graph further comprising edges connecting the compounds nodes, each edge representing a derivation dependency between versions of the versioned datasets represented by the compound nodes connected by the edge; wherein a sub-entry of the compound node for a particular versioned dataset in the set of versioned datasets is visually distinguished in the graphical user interface from other compound node sub-entries of the graph to indicate that a version, of the particular version dataset represented by the sub-entry has been flagged in a database as containing invalid data; wherein an edge in the graph representing a derivation dependency of a first version of a first versioned dataset in the set of versioned datasets on a second version of a second versioned dataset in the set of versioned datasets is visually distinguished from other edges in the graph to indicate that the first version of the first versioned dataset potentially contains invalid data as a result of the derivation dependency. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system comprising:
-
memory; one or more processors; one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs comprising instructions for; storing an input dataset and provenance metadata identifying one or more previous versions of the input dataset; using a derivation program, transforming data in the input dataset and storing the transformed data as a versioned dataset; updating the provenance metadata to identify the input dataset in addition to the one or more previous versions of the input dataset; receiving selection of a versioned dataset that is within a data pipeline system; determining full data provenance of the selected versioned dataset, the full data provenance comprising a set of versioned datasets, by identifying, in the provenance metadata, at least the input dataset and the one or more previous versions of the input dataset; providing for display of a visualization of the full data provenance of the selected versioned dataset, the visualization comprising a graph, the graph comprising a compound node for the selected versioned dataset and for each versioned dataset in the set of versioned datasets, the graph further comprising edges connecting the compounds nodes, each edge representing a derivation dependency between versions of the versioned datasets represented by the compound nodes connected by the edge; wherein a sub-entry of the compound node for a particular versioned dataset in the set of versioned datasets is visually distinguished in the graphical user interface from other compound node sub-entries of the graph to indicate that a version, of the particular version dataset represented by the sub-entry has been flagged in a database as containing invalid data; wherein an edge in the graph representing a derivation dependency of a first version of a first versioned dataset in the set of versioned datasets on a second version of a second versioned dataset in the set of versioned datasets is visually distinguished from other edges in the graph to indicate that the first version of the first versioned dataset potentially contains invalid data as a result of the derivation dependency. - View Dependent Claims (12, 13)
-
Specification