Dynamically performing data processing in a data pipeline system

US 10,176,217 B1
Filed: 09/07/2017
Issued: 01/08/2019
Est. Priority Date: 07/06/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

in association with a distributed data processing system that implements one or more data transformation pipelines, each of the data transformation pipelines comprising at least a first dataset, a first transformation, a second derived dataset and dataset dependency and timing metadata, detecting an arrival of a new raw dataset or new derived dataset;

in response to the detecting, obtaining from the dataset dependency and timing metadata a dataset subset comprising those datasets that depend on at least the new raw dataset or new derived dataset;

for each member dataset in the dataset subset, determining if the member dataset has a dependency on any other dataset that is not yet arrived, and in response to determining that the member dataset does not have a dependency on any other dataset that is not yet arrived;

initiating a build of a portion of the data transformation pipeline comprising the member dataset and all other datasets on which the member dataset is dependent, without waiting for arrival of other datasets;

detecting that a cutoff time has occurred, and in response thereto;

determining that a particular dataset on which the second derived dataset depends has not arrived;

in response thereto, initiating build operations for all other portions or derived datasets of the data transformation pipeline that have not yet been built but excluding the other portions or derived datasets that depend upon the particular dataset;

wherein the method is performed using one or more processors.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for automatically scheduling builds of derived datasets in a distributed database system that supports pipelined data transformations are described herein. In an embodiment, a data processing method comprises, in association with a distributed database system that implements one or more data transformation pipelines, each of the data transformation pipelines comprising at least a first dataset, a first transformation, a second derived dataset and dataset dependency and timing metadata, detecting an arrival of a new raw dataset or new derived dataset; in response to the detecting, obtaining from the dataset dependency and timing metadata a dataset subset comprising those datasets that depend on at least the new raw dataset or new derived dataset; for each member dataset in the dataset subset, determining if the member dataset has a dependency on any other dataset that is not yet arrived, and in response to determining that the member dataset does not have a dependency on any other dataset that is not yet arrived: initiating a build of a portion of the data transformation pipeline comprising the member dataset and all other datasets on which the member dataset is dependent, without waiting for arrival of other datasets.

64 Citations

View as Search Results

16 Claims

1. A computer-implemented method comprising:
- in association with a distributed data processing system that implements one or more data transformation pipelines, each of the data transformation pipelines comprising at least a first dataset, a first transformation, a second derived dataset and dataset dependency and timing metadata, detecting an arrival of a new raw dataset or new derived dataset;
  
  in response to the detecting, obtaining from the dataset dependency and timing metadata a dataset subset comprising those datasets that depend on at least the new raw dataset or new derived dataset;
  
  for each member dataset in the dataset subset, determining if the member dataset has a dependency on any other dataset that is not yet arrived, and in response to determining that the member dataset does not have a dependency on any other dataset that is not yet arrived;
  
  initiating a build of a portion of the data transformation pipeline comprising the member dataset and all other datasets on which the member dataset is dependent, without waiting for arrival of other datasets;
  
  detecting that a cutoff time has occurred, and in response thereto;
  
  determining that a particular dataset on which the second derived dataset depends has not arrived;
  
  in response thereto, initiating build operations for all other portions or derived datasets of the data transformation pipeline that have not yet been built but excluding the other portions or derived datasets that depend upon the particular dataset;
  
  wherein the method is performed using one or more processors.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, further comprising, in response to determining that the member dataset has a dependency on another dataset that is not yet arrived, recording that a partial dependency of the member dataset has been satisfied.
  - 3. The method of claim 1, further comprising detecting that a cutoff time has occurred, and in response thereto, initiating build operations for all other portions or derived datasets of the data transformation pipeline that have not yet been built.
  - 4. The method of claim 1, the first dataset comprising any of a first raw dataset, or a first derived dataset that was derived via a second transformation.
  - 5. The method of claim 1, the transformation comprising any of:
    - creating the derived dataset without a column that is in the raw dataset;
      
      creating the derived dataset with a column that is in the raw dataset and using a different name of the column in the derived dataset.
  - 6. The method of claim 1, further comprising:
    - in response to determining that the particular dataset has not arrived, transmitting a notification to a specified account or address.
  - 7. The method of claim 1, further comprising:
    - detecting that a cutoff time has occurred, and in response thereto;
      
      determining that the particular dataset is marked with a critical dataset flag value;
      
      in response thereto, transmitting a notification to a specified account or address.
  - 8. The method of claim 1, further comprising performing the detecting an arrival of a new raw dataset or new derived dataset only for datasets that are identified in a list of raw datasets to track.
  - 9. The method of claim 1, further comprising performing the detecting an arrival of a new raw dataset or new derived dataset only during an expected arrival period that is defined in stored configuration data.
  - 10. The method of claim 1, in which obtaining the dataset subset from the dataset dependency and timing metadata occurs just after the dataset dependency and timing metadata has been updated.
  - 11. The method of claim 1, wherein detecting an arrival of a new raw dataset or new derived dataset comprises determining that a timestamp of the new raw dataset or new derived dataset is not older, compared to a current time, than a specified recent time.
  - 12. The method of claim 1, wherein initiating a build comprises instantiating a build worker process and instructing the build worker process to build the portion of the data transformation pipeline comprising the member dataset and all other datasets on which the member dataset is dependent.
  - 13. The method of claim 1, the dataset dependency and timing metadata defining a non-directional dependency group of a plurality of datasets that are dependent upon one another, the method further comprising determining whether every dataset in the non-directional dependency group is updated, and initiating build operations for derived datasets depending upon the non-directional dependency group only when all datasets in the non-directional dependency group have received updates.
  - 14. The method of claim 1, the dataset dependency and timing metadata defining a directional dependency group of raw datasets all of which are dependent on a second group of datasets, the method further comprising determining that the first group of datasets is updated only after all datasets in the second group are updated, and initiating build operations for derived datasets depending upon the directional dependency group only when all datasets in the directional dependency group have received updates.

15. A computer-implemented method comprising:
- storing a tree representing dependency information for a plurality of datasets comprising a plurality of nodes, wherein each node of the tree corresponds to a dataset of the plurality of datasets;
  
  storing a lookup table that stores a plurality of entries that correspond to the plurality of datasets, each particular entry of the plurality of entries corresponding to a particular dataset of the plurality of datasets and comprises a first timestamp representing the time that particular dataset was last modified and a second timestamp representing the time that particular dataset was last used for data processing;
  
  detecting a modification to a first dataset of the plurality of datasets;
  
  in response to detecting the modification to the first dataset, updating the first timestamp that corresponds to the first dataset;
  
  in response to detecting the modification, traversing the tree to identify the highest parent node in the tree for which all downstream nodes have a corresponding first timestamp that is later in time than a corresponding second timestamp;
  
  initiating a build of a portion of a data transformation pipeline comprising the identified highest parent node in the tree;
  
  wherein the method is performed using one or more processors.

16. A computer system comprising:
- one or more processors;
  
  one or more computer-readable storage media coupled to the one or more processors and storing one or more sequences of instructions which, when executed using the one or more processors, cause the one or more processors to perform;
  
  in association with a distributed database system that implements one or more data transformation pipelines, each of the data transformation pipelines comprising at least a first dataset, a first transformation, a second derived dataset and dataset dependency and timing metadata, detecting an arrival of a new raw dataset or new derived dataset;
  
  in response to the detecting, obtaining from the dataset dependency and timing metadata a dataset subset comprising those datasets that depend on at least the new raw dataset or new derived dataset;
  
  for each member dataset in the dataset subset, determining if the member dataset has a dependency on any other dataset that is not yet arrived, and in response to determining that the member dataset does not have a dependency on any other dataset that is not yet arrived;
  
  initiating a build of a portion of the data transformation pipeline comprising the member dataset and all other datasets on which the member dataset is dependent, without waiting for arrival of other datasets;
  
  detecting that a cutoff time has occurred, and in response thereto;
  
  determining that a particular dataset on which the second derived dataset depends has not arrived;
  
  in response thereto, initiating build operations for all other portions or derived datasets of the data transformation pipeline that have not yet been built but excluding the other portions or derived datasets that depend upon the particular dataset.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Dang, Hao, Brodman, Gustav, Xue, Yi, Milspaw, Stacey, Huang, Yifei, Lu, Yanran
Primary Examiner(s)
Singh, Amresh

Application Number

US15/698,574
Time in Patent Office

488 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/182   Distributed file systems

G06F 16/2365   Ensuring data consistency a...

G06F 16/24568   Data stream processing; Con...

G06F 16/254   Extract, transform and load...

G06F 9/45533   Hypervisors; Virtual machin...

Dynamically performing data processing in a data pipeline system

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

64 Citations

16 Claims

Specification

Use Cases

Quick Links

Others

Dynamically performing data processing in a data pipeline system

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

64 Citations

16 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others