Asynchronous stochastic gradient descent

US 10,628,740 B2
Filed: 05/05/2016
Issued: 04/21/2020
Est. Priority Date: 10/02/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for asynchronous stochastic gradient descent, the method comprising:

computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server;

generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch;

performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and

updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The example computer-implemented method may comprise computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners. The method may also comprise generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient. The method may further comprise performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence. Additionally, the method may comprise updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence.

4 Citations

18 Claims

1. A computer-implemented method for asynchronous stochastic gradient descent, the method comprising:
- computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server;
  
  generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch;
  
  performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and
  
  updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer-implemented method of claim 1, wherein the generator processor and the reconciler processor execute simultaneously.
  - 3. The computer-implemented method of claim 2, further comprising:
    - communicating the plurality of triples from the generator processor to the reconciler processor using a non-blocking to-Learner buffer; and
      
      communicating the current weight with index from the reconciler processor to the generator processor through the use of a non-blocking from-Learner buffer.
  - 4. The computer-implemented method of claim 3, further comprising:
    - performing a summing-in of a new gradient into an existing gradient when it is determined that the non-blocking to-Learner buffer is full.
  - 5. The computer-implemented method of claim 1, wherein performing the allreduce operation comprises performing a summation function on the plurality of triples to sum the masses of the plurality of triples.
  - 6. The computer-implemented method of claim 1, wherein performing the allreduce operation comprises performing a minimum function on the plurality of triples to determine a minimum of the weight indices of the plurality of triples.
  - 7. The computer-implemented method of claim 1, wherein the allreduce operation is performed in stages across subsets of the plurality of learners.
  - 8. The computer-implemented method of claim 1, wherein the allreduced triple sequence comprises at least one triple, and wherein a mass associated with each of the at least one triples is equal to a given target mass.
  - 9. The computer-implemented method of claim 1, wherein at least one of the generator processor and the reconciler processor are multi-threaded processors.

10. A system for Asynchronous stochastic gradient descent, the system comprising:
- a processor in communication with one or more types of memory, the processor configured to;
  
  compute, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server;
  
  generate, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch;
  
  perform, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and
  
  update by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. The system of claim 10, wherein the generator processor and the reconciler processor execute simultaneously.
  - 12. The system of claim 11, wherein the processor is further configured to:
    - communicate the plurality of triples from the generator processor to the reconciler processor using a non-blocking to-Learner buffer; and
      
      communicate the current weight with index from the reconciler processor to the generator processor through the use of a non-blocking from-Learner buffer.
  - 13. The system of claim 12, wherein the processor is further configured to:
    - perform a summing-in of a new gradient into an existing gradient when it is determined that the non-blocking to-Learner buffer is full.
  - 14. The system of claim 10, wherein performing the allreduce operation comprises performing a summation function on the plurality of triples to sum the masses of the plurality of triples.
  - 15. The system of claim 10, wherein performing the allreduce operation comprises performing a minimum function on the plurality of triples to determine a minimum of the weight indices of the plurality of triples.
  - 16. The system of claim 10, wherein the allreduce operation is performed in stages across subsets of the plurality of learners.
  - 17. The system of claim 10, wherein the allreduced triple sequence comprises at least one triple, and wherein a mass associated with each of the at least one triples is equal to a given target mass.

18. A computer program product for asynchronous stochastic gradient descent, the computer program product comprising:
- a non-transitory storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising;
  
  computing, by a generator processor on each of a plurality of learners, a gradient for a mini-batch using a current weight at each of the plurality of learners, the current weight being uniquely identified by a weight index of each of the plurality of learners, wherein the plurality of learners are arranged in a peer-to-peer arrangement without a parameter server;
  
  generating, by the generator processor on each of the plurality of learners, a plurality of triples, wherein each of the triples comprises the gradient, the weight index of the current weights used to compute the gradient, and a mass of the gradient, the mass equaling a number of mini-batches used to generate the gradient times a number of observations in the mini-batch;
  
  performing, by a reconciler processor on each of the plurality of learners, an allreduce operation on the plurality of triples to obtain an allreduced triple sequence; and
  
  updating, by the reconciler processor on each of the plurality of learners, the current weight at each of the plurality of learners to a new current weight using the allreduced triple sequence, wherein the new current weight becomes the current weight for a next processing batch to be computed by the generator processor.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Kumar, Sameer, Saraswat, Vijay A.
Primary Examiner(s)
Afshar, Kamran
Assistant Examiner(s)
Chen, Ying Yu

Application Number

US15/146,917
Publication Number

US 20170098171A1
Time in Patent Office

1,447 Days
Field of Search
US Class Current
CPC Class Codes

G06N 3/084 Backpropagation, e.g. using...

Asynchronous stochastic gradient descent

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

4 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Asynchronous stochastic gradient descent

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

4 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links