Universal data pipeline

US 10,191,926 B2
Filed: 03/07/2018
Issued: 01/29/2019
Est. Priority Date: 11/05/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

at one or more computing devices comprising one or more processors and one or more storage media storing one or more computer programs executed by the one or more processors to perform the method, performing operations comprising;

maintaining a build catalog comprising a plurality of build catalog entries, each build catalog entry comprisingan identifier of a version of a derived dataset corresponding to the build catalog entry,one or more dataset build dependencies of the version of the derived dataset corresponding to the build catalog entry, each of the one or more dataset build dependencies comprising an identifier of a version of a child dataset from which the version of the derived dataset corresponding to the build catalog entry is derived, anda derivation program build dependency that is executable to generate the version of the derived dataset corresponding to the build catalog entry;

creating a new version of a particular derived dataset by executing a particular version of a particular derivation program; and

adding a new build catalog entry to the build catalog, the new build catalog entry comprising an identifier of the new version of the particular derived dataset, an identifier of the particular version of the particular derivation program, and at least one identifier of one or more particular child dataset versions that were provided as input to the particular derivation program.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A history preserving data pipeline computer system and method. In one aspect, the history preserving data pipeline system provides immutable and versioned datasets. Because datasets are immutable and versioned, the system makes it possible to determine the data in a dataset at a point in time in the past, even if that data is no longer in the current version of the dataset.

764 Citations

20 Claims

1. A method comprising:
- at one or more computing devices comprising one or more processors and one or more storage media storing one or more computer programs executed by the one or more processors to perform the method, performing operations comprising;
  
  maintaining a build catalog comprising a plurality of build catalog entries, each build catalog entry comprisingan identifier of a version of a derived dataset corresponding to the build catalog entry,one or more dataset build dependencies of the version of the derived dataset corresponding to the build catalog entry, each of the one or more dataset build dependencies comprising an identifier of a version of a child dataset from which the version of the derived dataset corresponding to the build catalog entry is derived, anda derivation program build dependency that is executable to generate the version of the derived dataset corresponding to the build catalog entry;
  
  creating a new version of a particular derived dataset by executing a particular version of a particular derivation program; and
  
  adding a new build catalog entry to the build catalog, the new build catalog entry comprising an identifier of the new version of the particular derived dataset, an identifier of the particular version of the particular derivation program, and at least one identifier of one or more particular child dataset versions that were provided as input to the particular derivation program.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the derivation program build dependency of the version of the derived dataset corresponding to the build catalog entry comprises an identifier of a version of a derivation program executed to generate the version of the derived dataset corresponding to the build catalog entry.
  - 3. The method of claim 2, further comprising:
    - storing a first version of the derived dataset using a data lake;
      
      updating another dataset to produce a second version of the derived dataset;
      
      storing the second version of the derived dataset in the data lake in context of a successful transaction; and
      
      wherein the data lake comprises a distributed file system.
  - 4. The method of claim 3, wherein an identifier of the first version of the derived dataset is an identifier assigned to a commit of a transaction that stored the first version of the derived dataset;
    - andwherein an identifier of the second version of the derived dataset is an identifier assigned to a commit of a transaction that stored the second version of the derived dataset.
  - 5. The method of claim 3, wherein the first version of the derived dataset is stored in a first set of one or more data containers and the second version of the derived dataset is stored in a second set of one or more data containers.
  - 6. The method of claim 5, wherein the second set of one or more data containers comprises delta encodings reflecting deltas between the first version of the derived dataset and the second version of the derived dataset.
  - 7. The method of claim 3, wherein the first version of the derivation program is executed to produce the first version of the derived dataset.
  - 8. The method of claim 3, wherein the first version of the derivation program is executed to produce the second version of the derived dataset.
  - 9. The method of claim 2, further comprising:
    - creating the new version of the particular derived dataset based on providing one or more particular child dataset versions as input to the executing the particular version of the particular derivation program; and
      
      wherein the new build catalog entry comprises an identifier of each of the one or more particular child dataset versions.
  - 10. The method of claim 9, wherein the creating the new version of the particular derived dataset is based on providing the one or more particular child dataset versions as input to the executing the particular version of the particular derivation program.

11. A computer system comprising:
- one or more hardware processors;
  
  one or more computer programs; and
  
  one or more storage media storing the one or more computer programs for execution by the one or more hardware processors, the one or more computer programs comprising instructions for performing operations comprising;
  
  maintaining a build catalog comprising a plurality of build catalog entries, each build catalog entry comprisingan identifier of a version of a derived dataset corresponding to the build catalog entry,one or more dataset build dependencies of the version of the derived dataset corresponding to the build catalog entry, each of the one or more dataset build dependencies comprising an identifier of a version of a child dataset from which the version of the derived dataset corresponding to the build catalog entry is derived, anda derivation program build dependency that is executable to generate the version of the derived dataset corresponding to the build catalog entry;
  
  creating a new version of a particular derived dataset by executing a particular version of a particular derivation program; and
  
  adding a new build catalog entry to the build catalog, the new build catalog entry comprising an identifier of the new version of the particular derived dataset, an identifier of the particular version of the particular derivation program, and at least one identifier of one or more particular child dataset versions that were provided as input to the particular derivation program.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The computer system of claim 11, wherein the derivation program build dependency of the version of the derived dataset corresponding to the build catalog entry comprises an identifier of a version of a derivation program executed to generate the version of the derived dataset corresponding to the build catalog entry.
  - 13. The computer system of claim 12, wherein the one or more storage media stores additional computer programs for performing operations comprising:
    - storing a first version of the derived dataset using a data lake;
      
      updating another dataset to produce a second version of the derived dataset;
      
      storing the second version of the derived dataset in the data lake in context of a successful transaction; and
      
      wherein the data lake comprises a distributed file system.
  - 14. The computer system of claim 13, wherein an identifier of the first version of the derived dataset is an identifier assigned to a commit of a transaction that stored the first version of the derived dataset;
    - andwherein an identifier of the second version of the derived dataset is an identifier assigned to a commit of a transaction that stored the second version of the derived dataset.
  - 15. The computer system of claim 13, wherein the first version of the derived dataset is stored in a first set of one or more data containers and the second version of the derived dataset is stored in a second set of one or more data containers.
  - 16. The computer system of claim 15, wherein the second set of one or more data containers comprises delta encodings reflecting deltas between the first version of the derived dataset and the second version of the derived dataset.
  - 17. The computer system of claim 13, wherein the first version of the derivation program is executed to produce the first version of the derived dataset.
  - 18. The computer system of claim 13, wherein the first version of the derivation program is executed to produce the second version of the derived dataset.
  - 19. The computer system of claim 12, wherein the one or more storage media stores additional computer programs for performing operations comprising:
    - creating the new version of the particular derived dataset based on providing one or more particular child dataset versions as input to the executing the particular version of the particular derivation program; and
      
      wherein the new build catalog entry comprises an identifier of each of the one or more particular child dataset versions.
  - 20. The computer system of claim 19, wherein the creating the new version of the particular derived dataset is based on providing the one or more particular child dataset versions as input to the executing the particular version of the particular derivation program.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Meacham, Jacob, Harris, Michael, Brodman, Gustav, Cuthriell, Lynn, Korus, Hannah, Toth, Brian, Hsiao, Jonathan, Elliot, Mark, Schimpf, Brian, Garland, Michael, Nguyen, Evelyn
Primary Examiner(s)
Hoang, Son T

Application Number

US15/914,215
Publication Number

US 20180196838A1
Time in Patent Office

328 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/1865   Transactional file systems

G06F 16/1873   Versioning file systems, te...

G06F 16/211   Schema design and management

G06F 16/219   Managing data history or ve...

G06F 16/2365   Ensuring data consistency a...

G06F 16/2386   Bulk updating operations da...

G06F 16/254   Extract, transform and load...

Universal data pipeline

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

764 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Universal data pipeline

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

764 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others