In-Memory Dataflow Execution with Dynamic Placement of Cache Operations and Action Execution Ordering
First Claim
1. A method, comprising the steps of:
- obtaining a cost model for the execution of operations of a dataflow in a parallel processing framework with a given infrastructure and input dataset;
obtaining a current cache placement plan for the dataflow, wherein the current cache placement plan comprises a combination of output datasets of a subset of the operations in the dataflow to cache, using one or more cache operations in the dataflow, based on an estimated reduction in a total execution cost for the dataflow in conjunction with the current cache placement plan being implemented given an input dataset;
obtaining a current cache gain estimate for the current cache placement plan;
selecting an action of the dataflow to execute from a plurality of remaining actions in the dataflow based on a predefined next action policy that selects the next action of the dataflow to execute from the plurality of remaining actions in the dataflow;
executing one or more operations in a lineage of the selected action of the dataflow;
determining, using at least one processing device, an alternative cache placement plan for the dataflow following the execution in conjunction with a predefined new plan determination criteria being satisfied, wherein the alternative cache placement plan comprises an alternative combination of output datasets of a second subset of the operations in the dataflow to cache, using one or more alternative cache operations in the dataflow, relative to the current cache placement plan;
obtaining an alternative cache gain estimate for the alternative cache placement plan;
implementing, using the at least one processing device, the alternative cache placement plan in conjunction with the predefined new plan implementation criteria being satisfied; and
selecting a next action of the dataflow to execute from a plurality of remaining actions in the dataflow based on the predefined next action policy.
3 Assignments
0 Petitions
Accused Products
Abstract
A dataflow execution environment is provided with dynamic placement of cache operations and action execution ordering. An exemplary method comprises: obtaining a current cache placement plan for a dataflow comprised of multiple operations and a corresponding current cache gain estimate; selecting an action to execute from a plurality of remaining dataflow actions based on a predefined policy; executing one or more operations in a lineage of the selected action and estimating an error as a difference in an observed execution time and an estimated execution time given by a cost model; obtaining an alternative cache placement plan for the dataflow following the execution in conjunction with a predefined new plan determination criteria being satisfied and a corresponding alternative cache gain estimate; implementing the alternative cache placement plan in conjunction with a predefined new plan implementation criteria being satisfied; and selecting a next action to execute from a plurality of remaining actions in the dataflow based on a predefined policy.
-
Citations
20 Claims
-
1. A method, comprising the steps of:
-
obtaining a cost model for the execution of operations of a dataflow in a parallel processing framework with a given infrastructure and input dataset; obtaining a current cache placement plan for the dataflow, wherein the current cache placement plan comprises a combination of output datasets of a subset of the operations in the dataflow to cache, using one or more cache operations in the dataflow, based on an estimated reduction in a total execution cost for the dataflow in conjunction with the current cache placement plan being implemented given an input dataset; obtaining a current cache gain estimate for the current cache placement plan; selecting an action of the dataflow to execute from a plurality of remaining actions in the dataflow based on a predefined next action policy that selects the next action of the dataflow to execute from the plurality of remaining actions in the dataflow; executing one or more operations in a lineage of the selected action of the dataflow; determining, using at least one processing device, an alternative cache placement plan for the dataflow following the execution in conjunction with a predefined new plan determination criteria being satisfied, wherein the alternative cache placement plan comprises an alternative combination of output datasets of a second subset of the operations in the dataflow to cache, using one or more alternative cache operations in the dataflow, relative to the current cache placement plan; obtaining an alternative cache gain estimate for the alternative cache placement plan; implementing, using the at least one processing device, the alternative cache placement plan in conjunction with the predefined new plan implementation criteria being satisfied; and selecting a next action of the dataflow to execute from a plurality of remaining actions in the dataflow based on the predefined next action policy. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer program product, comprising a tangible machine-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by at least one processing device perform the following steps:
-
obtaining a cost model for the execution of operations of a dataflow in a parallel processing framework with a given infrastructure and input dataset; obtaining a current cache placement plan for the dataflow, wherein the current cache placement plan comprises a combination of output datasets of a subset of the operations in the dataflow to cache, using one or more cache operations in the dataflow, based on an estimated reduction in a total execution cost for the dataflow in conjunction with the current cache placement plan being implemented given an input dataset; obtaining a current cache gain estimate for the current cache placement plan; selecting an action of the dataflow to execute from a plurality of remaining actions in the dataflow based on a predefined next action next action policy that selects the next action of the dataflow to execute from the plurality of remaining actions in the dataflow; executing one or more operations in a lineage of the selected action of the dataflow; determining an alternative cache placement plan for the dataflow following the execution in conjunction with a predefined new plan determination criteria being satisfied, wherein the alternative cache placement plan comprises an alternative combination of output datasets of a second subset of the operations in the dataflow to cache, using one or more alternative cache operations in the dataflow, relative to the current cache placement plan; obtaining an alternative cache gain estimate for the alternative cache placement plan; implementing, using the at least one processing device, the alternative cache placement plan in conjunction with the predefined new plan implementation criteria being satisfied; and selecting a next action of the dataflow to execute from a plurality of remaining actions in the dataflow based on the predefined next action policy. - View Dependent Claims (13, 14, 15)
-
-
16. A system, comprising:
-
a memory; and at least one processing device, coupled to the memory, operative to implement the following steps; obtaining a cost model for the execution of operations of a dataflow in a parallel processing framework with a given infrastructure and input dataset; obtaining a cost model for the execution of operations of a dataflow in a parallel processing framework with a given infrastructure and input dataset; obtaining a current cache placement plan for the dataflow, wherein the current cache placement plan comprises a combination of output datasets of a subset of the operations in the dataflow to cache, using one or more cache operations in the dataflow, based on an estimated reduction in a total execution cost for the dataflow in conjunction with the current cache placement plan being implemented given an input dataset; obtaining a current cache gain estimate for the current cache placement plan; selecting an action of the dataflow to execute from a plurality of remaining actions in the dataflow based on a predefined next action next action policy that selects the next action of the dataflow to execute from the plurality of remaining actions in the dataflow; executing one or more operations in a lineage of the selected action of the dataflow; determining an alternative cache placement plan for the dataflow following the execution in conjunction with a predefined new plan determination criteria being satisfied, wherein the alternative cache placement plan comprises an alternative combination of output datasets of a second subset of the operations in the dataflow to cache, using one or more alternative cache operations in the dataflow, relative to the current cache placement plan; obtaining an alternative cache gain estimate for the alternative cache placement plan; implementing, using the at least one processing device, the alternative cache placement plan in conjunction with the predefined new plan implementation criteria being satisfied; and selecting a next action of the dataflow to execute from a plurality of remaining actions in the dataflow based on the predefined next action policy. - View Dependent Claims (17, 18, 19, 20)
-
Specification