Cache management for map-reduce applications
First Claim
1. A method for optimizing a cache on a computing node for a MapReduce application on a distributed file system, the method comprising:
- training a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the first parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node;
receiving, by a computer, the map request for the MapReduce application on the distributed file system that includes one or more storage medium connected to the computing node;
receiving, by the computer, first parameters for processing the map request;
determining, by the trained first machine learning model, the optimal cache slice size for the computing node for processing the map request corresponding to the shortest processing time of the map request, wherein the optimal cache slice size is determined based on the received first parameters for processing the map request;
reading, by the computing node, based on the determined optimal cache slice size, data from the one or more storage medium of the distributed file system into the cache of the computing node;
processing, by the computing node, the map request; and
writing, by the computing node, a final result data of the map request processing to the one or more storage medium.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer manages a cache for a MapReduce application based on a distributed file system that includes one or more storage medium by receiving a map request and receiving parameters for processing the map request. The parameters include a total data size to be processed, a size of each data record, and a number of map requests executing simultaneously. The computer determines a cache size for processing the map request, wherein the cache size is determined based on the received parameters for processing the map request and a machine learning model for a map request cache size and reads, based on the determined cache size, data from the one or more storage medium of the distributed file system into the cache. The computer processes the map request and writes an intermediate result data of the map request processing into the cache, based on the determined cache size.
41 Citations
21 Claims
-
1. A method for optimizing a cache on a computing node for a MapReduce application on a distributed file system, the method comprising:
- training a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the first parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node;
receiving, by a computer, the map request for the MapReduce application on the distributed file system that includes one or more storage medium connected to the computing node;
receiving, by the computer, first parameters for processing the map request;
determining, by the trained first machine learning model, the optimal cache slice size for the computing node for processing the map request corresponding to the shortest processing time of the map request, wherein the optimal cache slice size is determined based on the received first parameters for processing the map request;
reading, by the computing node, based on the determined optimal cache slice size, data from the one or more storage medium of the distributed file system into the cache of the computing node;
processing, by the computing node, the map request; and
writing, by the computing node, a final result data of the map request processing to the one or more storage medium. - View Dependent Claims (2, 3, 4, 5, 6, 7)
- training a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the first parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node;
-
8. A computer program product for optimizing a cache on a computing node for a MapReduce application on a distributed file system, the computer program product comprising one or more computer readable storage medium and program instructions stored on at least one of the one or more computer readable storage medium, the program instructions comprising;
- program instructions to train a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the first parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node;
program instructions to receive, by a computer, the map request for the MapReduce application on the distributed file system that includes one or more storage medium connected to the computing node;
program instructions to receive, by the computer, first parameters for processing the map request;
program instructions to determine, by the trained first machine learning model, the optimal cache slice size for the computing node for processing the map request corresponding to the shortest processing time of the map request, wherein the cache slice size is determined based on the received first parameters for processing the map request;
program instructions to read, by the computing node, based on the determined optimal cache slice size, data from the one or more storage medium of the distributed the system into the cache of the computing node;
program instructions to process, by the computing node, the map request; and
program instructions to write, by the computing node, a final result data of the map request processing to the one or more storage medium. - View Dependent Claims (9, 10, 11, 12, 13, 14)
- program instructions to train a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the first parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node;
-
15. A computer system for optimizing a cache on a computing node for a MapReduce application on a distributed file system, the computer system comprising one or more processors, one or more computer readable memories, one or more computer readable tangible storage medium, and program instructions stored on at least one of the one or more storage medium for execution by at least one of the one or more processors via at least one of the one or more memories, the program instructions comprising:
- program instructions to train a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node;
program instructions to receive, by a computer, the map request for the MapReduce application on the distributed file system that includes one or more storage medium connected to the computing node;
program instructions to receive, by the computer, first parameters for processing the map request;
program instructions todetermine, by the trained first machine learning model, the optimal cache slice size for the computing node for processing the map request corresponding to the shortest processing time of the map request, wherein the cache slice size is determined based on the received first parameters for processing the map request;
program instructions to read, by the computing node, based on the determined optimal cache slice size, data from the one or more storage medium of the distributed file system into the cache of the computing node;
program instructions to process, by the computing node, the map request; and
program instructions to write, by the computing node, a final result data of the map request processing to the one or more storage medium. - View Dependent Claims (16, 17, 18, 19, 20, 21)
- program instructions to train a first machine learning model to determine an optimal cache slice size on the computing node for processing a map request in a shortest processing time based on first parameters in historical records for previously executed map tasks on the computing node, the parameters including a first total data size to be processed, a first size of each data record, and a number of map tasks that will execute simultaneously on the computing node;
Specification