History preserving data pipeline system and method
First Claim
Patent Images
1. A method for preserving history of a derived dataset, the method comprising:
- at one or more computing devices comprising one or more processors and storage media storing one or more computer programs executed by the one or more processors to perform the method, perform operations of;
storing a first version of a derived dataset;
wherein the first version of the derived dataset is derived from at least a first version of another dataset by executing a first version of derivation program associated with the derived dataset;
storing a first build catalog entry, the first build catalog entry associated with the derived dataset and comprising an identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program;
wherein the first build catalog entry comprises a name of the derived dataset and an identifier of the first version of the derived dataset;
updating the other dataset to produce a second version of the other dataset;
storing a second version of the derived dataset;
wherein the second version of the derived dataset is derived from at least the second version of the other dataset by executing the first version of the derivation program associated with the derived dataset;
storing a second build catalog entry, the second build catalog entry associated with the derived dataset and comprising an identifier of the second version of the other dataset and comprising an identifier of the first version of the derivation program; and
wherein the second build catalog entry comprises the name of the derived dataset and an identifier of the second version of the derived dataset.
8 Assignments
0 Petitions
Accused Products
Abstract
A history preserving data pipeline computer system and method. In one aspect, the history preserving data pipeline system provides immutable and versioned datasets. Because datasets are immutable and versioned, the system makes it possible to determine the data in a dataset at a point in time in the past, even if that data is no longer in the current version of the dataset.
186 Citations
19 Claims
-
1. A method for preserving history of a derived dataset, the method comprising:
at one or more computing devices comprising one or more processors and storage media storing one or more computer programs executed by the one or more processors to perform the method, perform operations of; storing a first version of a derived dataset; wherein the first version of the derived dataset is derived from at least a first version of another dataset by executing a first version of derivation program associated with the derived dataset; storing a first build catalog entry, the first build catalog entry associated with the derived dataset and comprising an identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program; wherein the first build catalog entry comprises a name of the derived dataset and an identifier of the first version of the derived dataset; updating the other dataset to produce a second version of the other dataset; storing a second version of the derived dataset; wherein the second version of the derived dataset is derived from at least the second version of the other dataset by executing the first version of the derivation program associated with the derived dataset; storing a second build catalog entry, the second build catalog entry associated with the derived dataset and comprising an identifier of the second version of the other dataset and comprising an identifier of the first version of the derivation program; and wherein the second build catalog entry comprises the name of the derived dataset and an identifier of the second version of the derived dataset. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
18. A history preserving data pipeline system comprising:
-
one or more computing devices having one or more processors and memory; means for storing a first version of a derived dataset; wherein the first version of the derived dataset is derived from at least a first version of another dataset by executing a first version of derivation program associated with the derived dataset; means for storing a first build catalog entry, the first build catalog entry associated with the derived dataset and comprising an identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program; wherein the first build catalog entry comprises a name of the derived dataset and an identifier of the first version of the derived dataset; means for updating the other dataset to produce a second version of the other dataset;
means for storing a second version of the derived dataset;wherein the second version of the derived dataset is derived from at least the second version of the other dataset by executing the first version of the derivation program associated with the derived dataset; means for storing a second build catalog entry, the second build catalog entry associated with the derived dataset and comprising an identifier of the second version of the other dataset and comprising an identifier of the first version of the derivation program; and wherein the second build catalog entry comprises the name of the derived dataset and an identifier of the second version of the derived dataset.
-
-
19. A history preserving data pipeline system comprising:
-
one or more computing devices having one or more processors and memory; a data lake for persistently storing a first version of a derived dataset, a second version of the derived dataset, a first version of another dataset, and a second version of the other dataset; a build service for deriving the first version of the derived dataset from at least the first version of the other dataset by executing a first version of derivation program associated with the derived dataset, and for deriving the second version of the derived dataset from at least the second version of the other dataset by executing the first version of derivation program associated with the derived dataset; a build database comprising a first build catalog entry and a second build catalog entry, the first build catalog entry and the second build catalog entry associated with the derived dataset, the first build catalog entry comprising a first transaction commit identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program, the second build catalog entry comprising a second transaction commit identifier of the second version of the other dataset and comprising the identifier of the first version of the derivation program; and a transaction service for assigning the first transaction commit identifier of the first version of the other dataset to a first transaction that successfully commits the first version of the other database, for assigning the second transaction commit identifier of the second version of the other dataset to a second transaction that successfully commits the second version of the other database, for atomically creating a first entry in a transaction database responsive to successfully committing the first transaction, and for atomically creating a second entry in the transaction database responsive to successfully committing the second transaction, the first entry comprising the first transaction commit identifier, the second entry comprising the second transaction commit identifier.
-
Specification