TRAINING MACHINE LEARNING MODELS IN DISTRIBUTED COMPUTING SYSTEMS
First Claim
Patent Images
1. A method for training a machine learning model in a distributed computing system:
- receiving a model training request;
receiving a training data set;
determining a processing node available in a distributed computing system;
receiving static status information regarding the processing node;
causing a first container to be installed at the processing node based on the static status information, the first container being configured with a model training application;
causing a second container to be installed at the processing node based on the static status information, the second container being configured with the model training application;
assigning a first layer of a model to be trained by the model training application in the first container;
assigning a second layer of the model to be trained by the model training application in the second container;
receiving parameter data from the model training application in the first container, the model training application in the second container, and the model training application in the third container; and
calculating a model parameter based on the parameter data.
1 Assignment
0 Petitions
Accused Products
Abstract
Certain aspects of the present disclosure provide methods and systems for training a machine learning model, such as a neural network or deep learning model, in a distributed computing system. In some embodiments, aspects of the machine learning model are trained within containers distributed amongst nodes in the distributed computing environment.
21 Citations
20 Claims
-
1. A method for training a machine learning model in a distributed computing system:
-
receiving a model training request; receiving a training data set; determining a processing node available in a distributed computing system; receiving static status information regarding the processing node; causing a first container to be installed at the processing node based on the static status information, the first container being configured with a model training application; causing a second container to be installed at the processing node based on the static status information, the second container being configured with the model training application; assigning a first layer of a model to be trained by the model training application in the first container; assigning a second layer of the model to be trained by the model training application in the second container; receiving parameter data from the model training application in the first container, the model training application in the second container, and the model training application in the third container; and calculating a model parameter based on the parameter data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus for managing deployment of distributed computing resources, comprising:
-
a memory comprising computer-executable instructions; and a processor in data communication with the memory and configured to execute the computer-executable instructions and cause the apparatus to perform a method for training a machine learning model in a distributed computing system, the method comprising; receiving a model training request; receiving a training data set; determining a processing node available in a distributed computing system; receiving static status information regarding the processing node; causing a first container to be installed at the processing node based on the static status information, the first container being configured with a model training application; causing a second container to be installed at the processing node based on the static status information, the second container being configured with the model training application; assigning a first layer of a model to be trained by the model training application in the first container; assigning a second layer of the model to be trained by the model training application in the second container; receiving parameter data from the model training application in the first container and the model training application in the second container; and calculating a model parameter based on the parameter data. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A non-transitory computer-readable medium comprising instructions for performing a method for training a machine learning model in a distributed computing system, the method comprising:
-
receiving a model training request; receiving a training data set; determining a processing node available in a distributed computing system; receiving static status information regarding the processing node; causing a first container to be installed at the processing node based on the static status information, the first container being configured with a model training application; causing a second container to be installed at the processing node based on the static status information, the second container being configured with the model training application; assigning a first layer of a model to be trained by the model training application in the first container; assigning a second layer of the model to be trained by the model training application in the second container; receiving parameter data from the model training application in the first container, the model training application in the second container, and the model training application in the third container; and calculating a model parameter based on the parameter data. - View Dependent Claims (20)
-
Specification