Dynamic cache management for in-memory data analytic platforms
First Claim
1. A method comprising:
- obtaining, at a cache manager of a directed acyclic graph-based data analytic platform, from each of a plurality of monitor components on a plurality of worker nodes of said directed acyclic graph-based data analytic platform, statistics for a plurality of tasks executing on said worker nodes, said statistics comprising which of said tasks have been processed and which are in a task queue, each of said tasks having at least one distributed dataset associated therewith, each of said worker nodes having a distributed dataset cache;
obtaining, at said cache manager, from a directed acyclic graph scheduler component of said directed acyclic graph-based data analytic platform, a current stage directed acyclic graph;
for a given one of said tasks which has been processed, and for which, based on said current stage directed acyclic graph, it is determined that no other ones of said tasks depend on said at least one distributed dataset associated with said given one of said tasks, evicting said distributed dataset associated with said given one of said tasks from a corresponding one of said distributed dataset caches;
monitoring, with said monitor components, memory usage statistics for said worker nodes of said directed acyclic graph-based data analytic platform; and
increasing a size of a given resilient distributed dataset cache of a plurality of resilient distributed dataset caches if said memory usage statistics indicate that corresponding ones of said tasks are using too little memory, said distributed dataset caches comprising said resilient distributed dataset caches, said memory usage statistics comprising garbage collection time.
1 Assignment
0 Petitions
Accused Products
Abstract
At a cache manager of a directed acyclic graph-based data analytic platform, from each of a plurality of monitor components on a plurality of worker nodes, statistics are obtained for a plurality of tasks, including which of the tasks have been processed and which are in a task queue. Each of the tasks has at least one associated distributed dataset. Each worker has a distributed dataset cache. A current stage directed acyclic graph is obtained from a directed acyclic graph scheduler component. For a given one of the tasks which has been processed, and for which it is determined that no other ones of the tasks depend on the at least one distributed dataset associated with the given one of the tasks, the distributed dataset is evicted from a corresponding one of the distributed dataset caches.
23 Citations
20 Claims
-
1. A method comprising:
-
obtaining, at a cache manager of a directed acyclic graph-based data analytic platform, from each of a plurality of monitor components on a plurality of worker nodes of said directed acyclic graph-based data analytic platform, statistics for a plurality of tasks executing on said worker nodes, said statistics comprising which of said tasks have been processed and which are in a task queue, each of said tasks having at least one distributed dataset associated therewith, each of said worker nodes having a distributed dataset cache; obtaining, at said cache manager, from a directed acyclic graph scheduler component of said directed acyclic graph-based data analytic platform, a current stage directed acyclic graph; for a given one of said tasks which has been processed, and for which, based on said current stage directed acyclic graph, it is determined that no other ones of said tasks depend on said at least one distributed dataset associated with said given one of said tasks, evicting said distributed dataset associated with said given one of said tasks from a corresponding one of said distributed dataset caches; monitoring, with said monitor components, memory usage statistics for said worker nodes of said directed acyclic graph-based data analytic platform; and increasing a size of a given resilient distributed dataset cache of a plurality of resilient distributed dataset caches if said memory usage statistics indicate that corresponding ones of said tasks are using too little memory, said distributed dataset caches comprising said resilient distributed dataset caches, said memory usage statistics comprising garbage collection time. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A directed acyclic graph-based data analytic platform comprising:
-
a plurality of worker nodes; a plurality of monitor components on said plurality of worker nodes; a plurality of distributed dataset caches on said plurality of worker nodes; a directed acyclic graph scheduler component; and a cache manager, coupled to said plurality of monitor components and said directed acyclic graph scheduler component, which; obtains, from each of said plurality of monitor components, statistics for a plurality of tasks executing on said worker nodes, said statistics comprising which of said tasks have been processed and which are in a task queue, each of said tasks having at least one distributed dataset associated therewith; obtains, from said directed acyclic graph scheduler component, a current stage directed acyclic graph; for a given one of said tasks which has been processed, and for which, based on said current stage directed acyclic graph, it is determined that no other ones of said tasks depend on said at least one distributed dataset associated with said given one of said tasks, sends instructions to evict said distributed dataset associated with said given one of said tasks from a corresponding one of said distributed dataset caches; wherein said monitor components monitor memory usage statistics for said worker nodes of said directed acyclic graph-based data analytic platform, and wherein said cache manager facilitates increasing a size of a given resilient distributed dataset cache of a plurality of resilient distributed dataset caches if said memory usage statistics indicate that corresponding ones of said tasks are using too little memory, said distributed dataset caches comprising said resilient distributed dataset caches, said memory usage statistics comprising garbage collection time. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A non-transitory computer readable medium comprising computer executable instructions which when executed by a computer cause the computer to perform a method comprising:
-
obtaining, at a cache manager of a directed acyclic graph-based data analytic platform, from each of a plurality of monitor components on a plurality of worker nodes of said directed acyclic graph-based data analytic platform, statistics for a plurality of tasks executing on said worker nodes, said statistics comprising which of said tasks have been processed and which are in a task queue, each of said tasks having at least one distributed dataset associated therewith, each of said worker nodes having a distributed dataset cache; obtaining, at said cache manager, from a directed acyclic graph scheduler component of said directed acyclic graph-based data analytic platform, a current stage directed acyclic graph; for a given one of said tasks which has been processed, and for which, based on said current stage directed acyclic graph, it is determined that no other ones of said tasks depend on said at least one distributed dataset associated with said given one of said tasks, evicting said distributed dataset associated with said given one of said tasks from a corresponding one of said distributed dataset caches;
wherein said monitor components monitor memory usage statistics for said worker nodes of said directed acyclic graph-based data analytic platform, and wherein said cache manager facilitates increasing a size of a given resilient distributed dataset cache of a plurality of resilient distributed dataset caches if said memory usage statistics indicate that corresponding ones of said tasks are using too little memory, said distributed dataset caches comprising said resilient distributed dataset caches, said memory usage statistics comprising garbage collection time. - View Dependent Claims (20)
-
Specification