Training Method, Apparatus, and Chip for Neural Network Model

US 20190332944A1
Filed: 05/29/2019
Published: 10/31/2019
Est. Priority Date: 11/29/2016
Status: Abandoned Application

First Claim

Patent Images

1. A training method for a neural network model applied to a training system, wherein the training method comprises:

determining, by each of at least one M processor cores for each layer of L layers of the neural network model, a model training mode of a layer of the L layers based on an estimated data volume in a model parameter set and an estimated data volume of output data of the layer, wherein the training system comprises the M processor cores, and wherein M and L are integers greater than or equal to 1; and

performing, by each of the M processor cores, training to the layer using a determined model training mode, wherein the determined model training mode comprises at least one of a data parallel training mode or a model parallel training mode.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A training method, apparatus, and chip for a neural network model includes determining a model training mode of each layer based on an estimated data volume in a model parameter set and an estimated data volume of output data of the layer, obtaining second output data that is obtained by m worker modules by training a (j−1)^thlayer, and directly obtaining by a worker module a global gradient of a model parameter by training the model parameter based on the second output data when a model parallel training mode is used for a j^thlayer.

Citations

20 Claims

1. A training method for a neural network model applied to a training system, wherein the training method comprises:
- determining, by each of at least one M processor cores for each layer of L layers of the neural network model, a model training mode of a layer of the L layers based on an estimated data volume in a model parameter set and an estimated data volume of output data of the layer, wherein the training system comprises the M processor cores, and wherein M and L are integers greater than or equal to 1; and
  
  performing, by each of the M processor cores, training to the layer using a determined model training mode, wherein the determined model training mode comprises at least one of a data parallel training mode or a model parallel training mode.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The training method of claim 1, wherein the determined model training mode of a (j−
    - 1)^thlayer of the L layers is the data parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein the performing comprises performing data parallel training on a model parameter of the (j−
      
      1)^thlayer, wherein first output data is used as input data of a j^thlayer of the L layers, and wherein the first output data is output data obtained by each of the M processor cores training the (j−
      
      1)^thlayer.
  - 3. The training method of claim 1, wherein the determined model training mode of a (j−
    - 1)^thlayer of the L layers is the model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein the performing comprises performing model parallel training on a model parameter of the (j−
      
      1)^thlayer, wherein second output data is used as input data of a j^thlayer of the L layers, wherein the second output data is output data obtained by m processor cores training the (j−
      
      1)^thlayer, wherein the m processor cores are one or more of the M processor cores used for training the (j−
      
      1)^thlayer, wherein m is an integer greater than or equal to 1 and less than or equal to M, and wherein a value of m of at least one of the L layers is greater than 1.
  - 4. The training method of claim 1, wherein when the estimated data volume in the model parameter set is not greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the data parallel training mode.
  - 5. The training method of claim 1, wherein when the estimated data volume in the model parameter set is greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the model parallel training mode.
  - 6. The training method of claim 1, wherein the determined model training mode of a (j−
    - 1)^thlayer of the L layers is the model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein the performing comprises;
      
      determining, based on a model parameter set of a j^thlayer of the L layers, a model parameter subset of the j^thlayer that is to be trained by each of the M processor cores; and
      
      performing the model parallel training on the model parameter subset of the j^thlayer, wherein second output data is used as input data of the j^thlayer and an intersection set between model parameter subsets of the j^thlayer that are trained by any two of the at least one M processors is empty, wherein the second output data is output data obtained by m processor cores training a (j−
      
      1)^thlayer of the L layers, and wherein a union set of model parameter subsets of the j^thlayer that are trained by all of the at least one M processor core is equal to a universal set of model parameters of the j^thlayer.
  - 7. The training method of claim 1, wherein based on the model parallel training mode being used for a j^thlayer, the method further comprises:
    - dividing second output data into a first input data subblock and a second input data subblock, wherein the second output data is output data obtained by m processor cores training a (j−
      
      1)^thlayer of the L layers;
      
      using the second output data as input data of the j^thlayer of the L layers;
      
      performing model parallel training on a model parameter of the j^thlayer of the L layers, comprising;
      
      receiving the first input data subblock;
      
      performing in parallel all of the following;
      
      performing the model parallel training on the model parameter of the j^thlayer based on the first input data subblock to obtain first output subdata of the j^thlayer;
      
      receiving the second input data subblock; and
      
      performing the model parallel training on the model parameter of the j^thlayer based on the second input data subblock to obtain second output subdata of the j^thlayer; and
      
      transmitting the first output subdata of the j^thlayer to a (j+1)^thlayer of the L layers.

8. A training apparatus for a neural network model, wherein the training apparatus comprises:
- a memory configured to store instructions;
  
  a processor coupled to the memory and configured to execute the instructions, wherein the processor comprises at least one processor core; and
  
  a transceiver coupled to the processor and the memory, wherein the training apparatus is applicable to a training system that comprises M processor cores, wherein the neural network model comprises L layers, wherein M and L are integers greater than or equal to 1, wherein for each layer of the L layers, the at least one processor core is used to train the layer, wherein the processor is configured to control the transceiver to transmit data to a second processor core in the M processor cores, and wherein the instructions cause each of the at least one processor core to be configured to;
  
  determining, a model training mode of the layer based on an estimated data volume in a model parameter set and an estimated data volume of output data of the layer, wherein the training system comprises at least one M processor cores; and
  
  performing, an training to the layer using a determined training mode, wherein the determined model training mode comprises at least one of a data parallel training mode or a model parallel training mode.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The training apparatus of claim 8, wherein the determined model training mode of a (j−
    - 1)^thlayer of the L layers is the data parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein the instructions further cause each of the at least one processor core to perform data parallel training on a model parameter of a j^thlayer of the L layers, wherein first output data is used as input data of the j^thlayer, and wherein the first output data is output data obtained by each of the at least one processor core by training a (j−
      
      1)^thlayer.
  - 10. The training apparatus of claim 8, wherein the determined model training mode of a (j−
    - 1)^thlayer of the L layers is the model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein the instructions further cause each of the at least one processor core to perform model parallel training on a model parameter of a j^thlayer of the L layers, wherein a second output data is used as input data of the j^thlayer, wherein the second output data is output data obtained by m processor cores training the (j−
      
      1)^thlayer, wherein the m processor cores are one or more of the M processor cores used for training the (j−
      
      1)^thlayer, wherein m is an integer greater than or equal to 1 and less than or equal to M, and wherein a value of m of at least one of the L layers is greater than 1.
  - 11. The training apparatus of claim 8, wherein when the estimated data volume in the model parameter set is not greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the data parallel training mode.
  - 12. The training apparatus of claim 8, wherein when the estimated data volume in the model parameter set is greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the model parallel training mode.
  - 13. The training apparatus of claim 8, wherein when the model parallel training mode is used for a j^thlayer of the L layers, the instructions to cause each of the at least one processor core to use second output data as input data of the j^thlayer and to perform model parallel training on a model parameter of the j^thlayer further comprises instructions to cause each of the at least one processor core to:
    - determine, based on a model parameter set of the j^thlayer, a model parameter subset of the j^thlayer that is to be trained by each of the processor cores;
      
      perform the model parallel training on the model parameter subset of the j^thlayer, wherein the second output data is used as the input data of the j^thlayer and an intersection set between model parameter subsets of the j^thlayer that are trained by any two of the at least one M processors is empty, and wherein a union set of model parameter subsets of the j^thlayer that are trained by all of the at least one M processor core is equal to a universal set of model parameters of the j^thlayer.
  - 14. The training apparatus of claim 8, wherein when the model parallel training mode is used for a j^thlayer of the L layers and before the performing, the instructions further cause each of the at least one processor core to:
    - set a value of i to an integer that is greater than or equal to 1 and less than or equal to M;
      
      estimate a first total duration of i processor cores on training, wherein the first total duration is an estimated total duration of all the i processor cores on receiving a second input data and training the model parameter of the j^thlayer based on the second input data;
      
      update the value of i, wherein the updated value of i is another integer greater than or equal to 1 and less than or equal to M;
      
      estimate a second total duration of updated i processor cores on training, wherein the second total duration is an estimated total duration of the updated i processor cores on receiving the second input data and training the model parameter of the j^thlayer based on the second input data, wherein each value of i corresponds to one total duration;
      
      either update the value of i when a sum of a quantity of first total durations and a quantity of second total durations is less than a quantity threshold;
      
      ordetermine a third total duration based on a sum of a quantity of first total durations and a quantity of second total durations is equal to a quantity threshold, wherein the total duration comprises a smaller value in the first total duration and the second total duration; and
      
      use a second value of i that corresponds to the total duration with a smaller value as a determined value of a quantity of the at least one processor core used for training the j^thlayer.

15. A training chip for a neural network model, applicable to a training system that comprises M chips, wherein the neural network model comprises L layers, wherein each of the M chips comprises at least one processor core, and wherein each of the at least one chip is configured to:
- determine, by each of at least one M processor cores for each layer of L layers of the neural network model, a model training mode of a layer of the L layers based on an estimated data volume in a model parameter set and an estimated data volume of output data of the layer, wherein the training system comprises at least one M processor cores, and wherein M and L are integers greater than or equal to 1; and
  
  perform, by each of the M processor cores, an training to the layer using a determined training mode, wherein the determined model training mode comprises at least one of a data parallel training mode or a model parallel training mode.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The training chip of claim 15, wherein the determined model training mode of a (j−
    - 1)^thlayer of the L layers is the data parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein each of the at least chip is configured to perform data parallel training on a model parameter of a j^thlayer of the L layers, wherein first output data is used as input data of the j^thlayer, and wherein the first output data is output data obtained by each of the at least one chip training the (j−
      
      1)^thlayer.
  - 17. The training chip of claim 15, wherein the determined model training mode of a (j−
    - 1)^thlayer of the L layers is model parallel training mode, wherein j is an integer greater than 1 and less than or equal to L, wherein each of the at least one chip is configured to perform model parallel training on a model parameter of a j^thlayer of the L layers, wherein second output data is used as input data of the j^thlayer, wherein the second output data is output data obtained by m processor cores training the (j−
      
      1)^thlayer, wherein the m processor cores are one or more of the M processor cores used for training the (j−
      
      1)^thlayer, wherein m is an integer greater than or equal to 1 and less than or equal to M, and wherein a value of m of at least one of the L layers is greater than 1.
  - 18. The training chip of claim 16, wherein when the estimated data volume in the model parameter set is not greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the data parallel training mode.
  - 19. The training chip of claim 16, wherein when the estimated data volume in the model parameter set is greater than the estimated data volume of the output data of a layer, the model training mode of the layer is the model parallel training mode.
  - 20. The training chip of claim 16, wherein when the model parallel training mode is used for the j^thlayer, each of the at least one chip is configured to use second output data as input data of the j^thlayer, wherein each of the at least one chip is configured to:
    - determine, based on a model parameter set of the j^thlayer, a model parameter subset of the j^thlayer that is to be trained by each of the M processor cores; and
      
      perform the model parallel training on the model parameter subset of the j^thlayer, wherein the second output data is used as the input data of the j^thlayer and an intersection set between model parameter subsets of the j^thlayer that are trained by any two of the at least one M processors is empty, and wherein a union set of model parameter subsets of the j^thlayer that are trained by all of the at least one M processor core is equal to a universal set of model parameters of the j^thlayer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Original Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Inventors
Bai, Xiaolong, Zhang, Changzheng, Xia, Mingzhen

Application Number

US16/425,012
Publication Number

US 20190332944A1
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06N 3/04   Architecture, e.g. intercon...

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/063   using electronic means

G06N 3/084   Backpropagation, e.g. using...

Training Method, Apparatus, and Chip for Neural Network Model

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Training Method, Apparatus, and Chip for Neural Network Model

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links