IMPLEMENTING PARAMETER SERVER IN NETWORKING INFRASTRUCTURE FOR HIGH-PERFORMANCE COMPUTING
First Claim
1. A method, comprising:
- executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system; and
executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques are provided for implementing a parameter server within a networking infrastructure of a computing system to reduce the communication bandwidth and latency for performing communication synchronization operations of the parameter server. For example, a method includes executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system, and executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.
-
Citations
20 Claims
-
1. A method, comprising:
-
executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system; and executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An article of manufacture comprising a processor-readable storage medium having stored program code of one or more software programs, wherein the program code is executable by one or more processors to implement method steps comprising:
-
executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system; and executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computing system, comprising:
-
a server cluster comprising a plurality of server nodes, wherein the server nodes comprise accelerator devices configured to execute a plurality of worker nodes to perform a distributed deep learning (DL) model training process to train model parameters of a DL model; and networking infrastructure to network connect the plurality of server nodes within the sever cluster, wherein the networking infrastructure is configured to execute a parameter server which aggregates local model parameters computed by the plurality of worker nodes and distributes aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system. - View Dependent Claims (17, 18, 19, 20)
-
Specification