History preserving data pipeline system and method

US 9,229,952 B1
Filed: 11/05/2014
Issued: 01/05/2016
Est. Priority Date: 11/05/2014
Status: Active Grant

First Claim

Patent Images

1. A method for preserving history of a derived dataset, the method comprising:

at one or more computing devices comprising one or more processors and storage media storing one or more computer programs executed by the one or more processors to perform the method, perform operations of;

storing a first version of a derived dataset;

wherein the first version of the derived dataset is derived from at least a first version of another dataset by executing a first version of derivation program associated with the derived dataset;

storing a first build catalog entry, the first build catalog entry associated with the derived dataset and comprising an identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program;

wherein the first build catalog entry comprises a name of the derived dataset and an identifier of the first version of the derived dataset;

updating the other dataset to produce a second version of the other dataset;

storing a second version of the derived dataset;

wherein the second version of the derived dataset is derived from at least the second version of the other dataset by executing the first version of the derivation program associated with the derived dataset;

storing a second build catalog entry, the second build catalog entry associated with the derived dataset and comprising an identifier of the second version of the other dataset and comprising an identifier of the first version of the derivation program; and

wherein the second build catalog entry comprises the name of the derived dataset and an identifier of the second version of the derived dataset.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A history preserving data pipeline computer system and method. In one aspect, the history preserving data pipeline system provides immutable and versioned datasets. Because datasets are immutable and versioned, the system makes it possible to determine the data in a dataset at a point in time in the past, even if that data is no longer in the current version of the dataset.

186 Citations

19 Claims

1. A method for preserving history of a derived dataset, the method comprising:
- at one or more computing devices comprising one or more processors and storage media storing one or more computer programs executed by the one or more processors to perform the method, perform operations of;
  
  storing a first version of a derived dataset;
  
  wherein the first version of the derived dataset is derived from at least a first version of another dataset by executing a first version of derivation program associated with the derived dataset;
  
  storing a first build catalog entry, the first build catalog entry associated with the derived dataset and comprising an identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program;
  
  wherein the first build catalog entry comprises a name of the derived dataset and an identifier of the first version of the derived dataset;
  
  updating the other dataset to produce a second version of the other dataset;
  
  storing a second version of the derived dataset;
  
  wherein the second version of the derived dataset is derived from at least the second version of the other dataset by executing the first version of the derivation program associated with the derived dataset;
  
  storing a second build catalog entry, the second build catalog entry associated with the derived dataset and comprising an identifier of the second version of the other dataset and comprising an identifier of the first version of the derivation program; and
  
  wherein the second build catalog entry comprises the name of the derived dataset and an identifier of the second version of the derived dataset.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, further comprising storing the first version of the derived dataset and the second version of the derived dataset in a data lake.
  - 3. The method of claim 2, wherein the data lake comprises a distributed file system.
  - 4. The method of claim 1, wherein the identifier of the first version of the derived dataset is an identifier assigned to a commit of a transaction that stored the first version of the derived dataset.
  - 5. The method of claim 1, wherein the identifier of the second version of the derived dataset is an identifier assigned to a commit of a transaction that stored the second version of the derived dataset.
  - 6. The method of claim 1, wherein the first version of the derived dataset is stored in a first set of one or more data containers and the second version of the derived dataset is stored in a second set of one or more data containers.
  - 7. The method of claim 5, wherein the second set of one or more data containers comprises delta encodings reflecting deltas between the first version of the derived dataset and the second version of the derived dataset.
  - 8. The method of claim 1, wherein the first version of the derivation program, when executed to produce the first version of the derived dataset, transforms data of the first version of the other dataset to produce data of the first version of the derived dataset.
  - 9. The method of claim 1, wherein the first version of the derivation program, when executed to produce the second version of the derived dataset, transforms data of the second version of the other dataset to produce data of the second version of the derived dataset.
  - 10. The method of claim 1, wherein the operations of storing the first version of the derived dataset and storing the second version of the derived dataset are performed by a data lake.
  - 11. The method of claim 1, wherein the operations of storing the first build catalog entry and storing the second build catalog entry are performed by a build service.
  - 12. The method of claim 1, wherein the operation of updating the other dataset to produce the second version of the other dataset is performed by a transaction service.
  - 13. The method of claim 1, wherein the first build catalog entry and the second build catalog entry are stored in a database.
  - 14. The method of claim 1, further comprising:
    - storing a transaction entry in a database comprising a transaction commit identifier of the first version of the derived dataset;
      
      wherein the first build catalog entry comprises the transaction commit identifier.
  - 15. The method of claim 1, further comprising:
    - storing a transaction entry in a database comprising a transaction commit identifier of the second version of the derived dataset;
      
      wherein the second build catalog entry comprises the transaction commit identifier.
  - 16. The method of claim of 1, further comprising:
    - storing a transaction entry in a database comprising a transaction commit identifier of the first version of the other dataset;
      
      wherein the identifier of the first version of the other dataset in the first build catalog entry is the transaction commit identifier.
  - 17. The method of claim 1, further comprising:
    - storing a transaction entry in a database comprising a transaction commit identifier of the second version of the other dataset;
      
      wherein the identifier of the second version of the other dataset in the second build catalog entry is the transaction commit identifier.

18. A history preserving data pipeline system comprising:
- one or more computing devices having one or more processors and memory;
  
  means for storing a first version of a derived dataset;
  
  wherein the first version of the derived dataset is derived from at least a first version of another dataset by executing a first version of derivation program associated with the derived dataset;
  
  means for storing a first build catalog entry, the first build catalog entry associated with the derived dataset and comprising an identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program;
  
  wherein the first build catalog entry comprises a name of the derived dataset and an identifier of the first version of the derived dataset;
  
  means for updating the other dataset to produce a second version of the other dataset;
  
  means for storing a second version of the derived dataset;
  
  wherein the second version of the derived dataset is derived from at least the second version of the other dataset by executing the first version of the derivation program associated with the derived dataset;
  
  means for storing a second build catalog entry, the second build catalog entry associated with the derived dataset and comprising an identifier of the second version of the other dataset and comprising an identifier of the first version of the derivation program; and
  
  wherein the second build catalog entry comprises the name of the derived dataset and an identifier of the second version of the derived dataset.

19. A history preserving data pipeline system comprising:
- one or more computing devices having one or more processors and memory;
  
  a data lake for persistently storing a first version of a derived dataset, a second version of the derived dataset, a first version of another dataset, and a second version of the other dataset;
  
  a build service for deriving the first version of the derived dataset from at least the first version of the other dataset by executing a first version of derivation program associated with the derived dataset, and for deriving the second version of the derived dataset from at least the second version of the other dataset by executing the first version of derivation program associated with the derived dataset;
  
  a build database comprising a first build catalog entry and a second build catalog entry, the first build catalog entry and the second build catalog entry associated with the derived dataset, the first build catalog entry comprising a first transaction commit identifier of the first version of the other dataset and comprising an identifier of the first version of the derivation program, the second build catalog entry comprising a second transaction commit identifier of the second version of the other dataset and comprising the identifier of the first version of the derivation program; and
  
  a transaction service for assigning the first transaction commit identifier of the first version of the other dataset to a first transaction that successfully commits the first version of the other database, for assigning the second transaction commit identifier of the second version of the other dataset to a second transaction that successfully commits the second version of the other database, for atomically creating a first entry in a transaction database responsive to successfully committing the first transaction, and for atomically creating a second entry in the transaction database responsive to successfully committing the second transaction, the first entry comprising the first transaction commit identifier, the second entry comprising the second transaction commit identifier.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Harris, Michael, Elliot, Mark, Meacham, Jacob, Brodman, Gustav, Cuthriell, Lynn, Korus, Hannah, Toth, Brian, Hsiao, Jonathan, Schimpf, Brian, Garland, Michael, Nguyen, Evelyn
Primary Examiner(s)
Hoang, Son T

Application Number

US14/533,433
Time in Patent Office

426 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/1865   Transactional file systems

G06F 16/1873   Versioning file systems, te...

G06F 16/211   Schema design and management

G06F 16/219   Managing data history or ve...

G06F 16/2365   Ensuring data consistency a...

G06F 16/2386   Bulk updating operations da...

G06F 16/254   Extract, transform and load...

History preserving data pipeline system and method

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

186 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

History preserving data pipeline system and method

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

186 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links