IMPLEMENTING PARAMETER SERVER IN NETWORKING INFRASTRUCTURE FOR HIGH-PERFORMANCE COMPUTING

US 20190325302A1
Filed: 04/23/2018
Published: 10/24/2019
Est. Priority Date: 04/23/2018
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system; and

executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are provided for implementing a parameter server within a networking infrastructure of a computing system to reduce the communication bandwidth and latency for performing communication synchronization operations of the parameter server. For example, a method includes executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system, and executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.

Citations

20 Claims

1. A method, comprising:
- executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system; and
  
  executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the plurality of worker nodes comprise virtual worker nodes that execute on hardware accelerator devices.
  - 3. The method of claim 1, wherein executing the parameter server within the networking infrastructure of the computing system comprises executing a parameter server node on a physical network device of the networking infrastructure.
  - 4. The method of claim 3, wherein the physical network device comprises a network interface card installed in a server node.
  - 5. The method of claim 3, wherein the physical network device comprises a computational switch device which is network connected to the one or more server nodes of the computing system.
  - 6. The method of claim 1, wherein executing the parameter server within the networking infrastructure of the computing system comprises executing a parameter server node on a virtual network element connected to or executing on a server node in the computing system.
  - 7. The method of claim 6, wherein the virtual network element comprises one of a virtual network interface card and a virtual switch.
  - 8. The method of claim 1, wherein executing the parameter server within the networking infrastructure of the computing system comprises distributing the parameter server over a plurality of network elements within the networking infrastructure of the computing system.
  - 9. The method of claim 8, wherein distributing the parameter server over the plurality of network elements comprises:
    - logically dividing the parameter server into a plurality of local parameter server nodes; and
      
      executing the local parameter server nodes on different network interface cards installed in different sever nodes of the computing system.
  - 10. The method of claim 8, wherein distributing the parameter server over the plurality of network elements comprises:
    - logically dividing the parameter server into a plurality of local parameter server nodes;
      
      executing the local parameter server nodes using different network elements of the networking infrastructure of the computing system;
      
      designating one of the local parameter server nodes of the parameter server to be a master parameter server node; and
      
      utilizing the master parameter server node to aggregate the local model parameters provided by other parameter server nodes of the parameter server, and to distribute the aggregated model parameter to the other parameter server nodes of the parameter server;
      
      wherein at least one set of local model parameters received by the master parameter server node from one other local parameter server node comprises a local-aggregated set of model parameters computed by the other local parameter server node using local model parameters received from two or more of the plurality of worker nodes.

11. An article of manufacture comprising a processor-readable storage medium having stored program code of one or more software programs, wherein the program code is executable by one or more processors to implement method steps comprising:
- executing a distributed deep learning (DL) model training process to train model parameters of a DL model using a plurality of worker nodes executing on one or more server nodes of a computing system; and
  
  executing a parameter server within a networking infrastructure of the computing system to aggregate local model parameters computed by the plurality of worker nodes and to distribute aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The article of manufacture of claim 11, wherein the plurality of worker nodes comprise virtual worker nodes that execute on hardware accelerator devices.
  - 13. The article of manufacture of claim 11, wherein executing the parameter server within the networking infrastructure of the computing system comprises executing a parameter server node on a physical network device of the networking infrastructure, wherein the physical network device comprises at least one of a network interface card installed in a server node and a computational switch device which is network connected to the one or more server nodes of the computing system.
  - 14. The article of manufacture of claim 11, wherein executing the parameter server within the networking infrastructure of the computing system comprises executing a parameter server node on a virtual network element connected to or executing on a server node in the computing system, wherein the virtual network element comprises one of a virtual network interface card and a virtual switch.
  - 15. The article of manufacture of claim 11, wherein executing the parameter server within the networking infrastructure of the computing system comprises distributing the parameter server over a plurality of network elements within the networking infrastructure of the computing system.

16. A computing system, comprising:
- a server cluster comprising a plurality of server nodes, wherein the server nodes comprise accelerator devices configured to execute a plurality of worker nodes to perform a distributed deep learning (DL) model training process to train model parameters of a DL model; and
  
  networking infrastructure to network connect the plurality of server nodes within the sever cluster, wherein the networking infrastructure is configured to execute a parameter server which aggregates local model parameters computed by the plurality of worker nodes and distributes aggregated model parameters to the plurality of worker nodes using the networking infrastructure of the computing system.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computing system of claim 16, wherein the plurality of worker nodes comprise virtual worker nodes that execute on the accelerator devices.
  - 18. The computing system of claim 16, wherein the parameter server executes on a physical network device of the networking infrastructure, wherein the physical network device comprises at least one of a network interface card installed in a server node and a computational switch device which is network connected to the server nodes.
  - 19. The computing system of claim 16, wherein the parameter server executes on a virtual network element connected to or executing on a server node in the computing system, wherein the virtual network element comprises one of a virtual network interface card and a virtual switch.
  - 20. The computing system of claim 16, wherein parameter server is distributed over a plurality of network elements of the networking infrastructure of the computing system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Savic, Dragan, Zhao, Junping

Granted Patent

US 11,315,013 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 2009/45579   I/O management, e.g. provid...

G06F 2009/45595   Network integration; Enabli...

G06F 8/31   Programming languages or pr...

G06F 9/44505   Configuring for program ini...

G06F 9/45541   Bare-metal, i.e. hypervisor...

G06F 9/45558   Hypervisor-specific managem...

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/08   Learning methods

G06N 3/084   Backpropagation, e.g. using...

H04L 67/1095   Replication or mirroring of...

IMPLEMENTING PARAMETER SERVER IN NETWORKING INFRASTRUCTURE FOR HIGH-PERFORMANCE COMPUTING

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

IMPLEMENTING PARAMETER SERVER IN NETWORKING INFRASTRUCTURE FOR HIGH-PERFORMANCE COMPUTING

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links