HARDWARE ACCELERATOR FOR COMPRESSED RNN ON FPGA

US 20180046897A1
Filed: 12/26/2016
Published: 02/15/2018
Est. Priority Date: 08/12/2016
Status: Active Grant

First Claim

Patent Images

1. A device for implementing compressed RNN (recurrent neural network), said device comprising:

a receiving unit, which is used to receive a plurality of input vectors and distributed them to a plurality of processing elements (PE);

a plurality of processing elements (PE), each of which comprising;

a reading unit configured to read weight matrices W, said W indicates weights of said RNN;

ALU configured to perform multiplication and addition calculation of said weight matrices W;

calculation buffer configured to store intermediate results of matrix-vector multiplication and output results to an assembling unit;

an assembling unit configured to receive results from PEs and assemble them into a complete result vector;

a controller unit configured for controlling said plurality of processing elements.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to recurrent neural network. In particular, the present invention relates to how to implement and accelerate a recurrent neural network based on an embedded FPGA. Specifically, it proposes an overall design processing method of matrix decoding, matrix-vector multiplication, vector accumulation and activation function. In another aspect, the present invention proposes an overall hardware design to implement and accelerate the above process.

19 Citations

16 Claims

1. A device for implementing compressed RNN (recurrent neural network), said device comprising:
- a receiving unit, which is used to receive a plurality of input vectors and distributed them to a plurality of processing elements (PE);
  
  a plurality of processing elements (PE), each of which comprising;
  
  a reading unit configured to read weight matrices W, said W indicates weights of said RNN;
  
  ALU configured to perform multiplication and addition calculation of said weight matrices W;
  
  calculation buffer configured to store intermediate results of matrix-vector multiplication and output results to an assembling unit;
  
  an assembling unit configured to receive results from PEs and assemble them into a complete result vector;
  
  a controller unit configured for controlling said plurality of processing elements.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The device of claim 1, further comprising:
    - hidden layer unit, configured to perform hidden layer'"'"'s activation function of said RNN.
  - 3. The device of claim 2, further comprising:
    - a vector buffer configured to receive the RNN'"'"'s hidden layer'"'"'s activation vector from said hidden layer unit and store the hidden layer'"'"'s activation vector and bias vector.
  - 4. The device of claim 3, further comprising:
    - an adder tree unit configured to perform vector accumulation on vectors output from the assembling unit and the vector buffer.
  - 5. The device of claim 1, wherein said receiving unit further comprises:
    - a plurality of first-in-first-out buffer, each of which corresponds to a PE.
  - 6. The device of claim 1, wherein said reading unit further comprises:
    - location unit configured to read and store location information of non-zero elements in the weight matrix W;
      
      decoding unit configured to decoding the weight matrix to obtain the weight value of the weight matrix W.
  - 7. The device of claim 1, wherein said ALU comprises:
    - a plurality of multipliers configured to perform multiplication operation on matrix elements and input vector elements;
      
      a plurality of adders configured to perform addition operation on the results of above multiplication operation.
  - 8. The device of claim 1, wherein said calculation buffer comprises:
    - a first output buffer and a second output buffer, said first and second buffer receive and output computation result alternatively, wherein while one buffer receives present computation result, the other buffer outputs the previous computation result.
  - 9. The device of claim 3, wherein said vector buffer further comprises:
    - a plurality of sub-buffers configured to store computation results of respective hidden layers.

10. A method for implementing compressed RNN based on FPGA, comprising:
- a) receiving data from off-chip memory and storing the data into on-chip memory of FPGA, wherein said data are related to RNN computation, including input vector, bias vector and weight matrices;
  
  b) decoding the data received in step a) using FPGA on-chip processor in order to obtain the real weights, and storing the real weight into FPGA on-chip memory;
  
  c) matrix computing by performing matrix-vector multiplication using FPGA on-chip processing elements and storing the result into FPGA on-chip memory;
  
  d) vector accumulating by performing vector accumulation using FPGA on-chip processing elements and storing the results into FPGA on-chip memory, said vectors including both resultant vector obtained in step c) and said bias vector;
  
  e) activation function computing by performing activation function on the result of the above step d), and storing the result into FPGA on-chip memory;
  
  iterating the above steps a), b), c), d), e) to obtain RNN'"'"'s activation sequences and computing RNN'"'"'s output sequence according to the activation sequences.
- View Dependent Claims (11, 12, 13)
- - 11. The method of claim 10, further comprising:
    - distributing received data to a plurality of parallel processing elements (PE) after receiving data in step a); and
      
      assembling results from each PE to obtain a complete result vector of matrix-vector multiplication after computation in step c).
  - 12. The method of claim 11, further comprising:
    - in each PE, providing a plurality of on-chip adders and multipliers for parallel processing.
  - 13. The method of claim 10, in at least one of said steps a), b), c), providing a pair of ping-pong buffers in the on-chip memory.

14. A method for implementing a Recurrent Neural Network (RNN), wherein the weights of said RNN being characterized by W_hhand W_hx, where W_hhis the weight matrix of hidden layers of said RNN, W_hxis the weigh matrix being applied to input of the hidden layers, where an activation to be applied to an input vector by said hidden layers is h_t, the input of said RNN is a series of input vectors x=(x₁,x₂. . . , x_T), said method comprises:
- initialization step of reading necessary data for computing W_hxx into a FPGA on-chip memory, said data including input vectors x and W_hx, where W_hxis a weight matrix to be applied to said input vector x;
  
  step 1 of computing W_hxx by processing elements of said FPGA, and reading necessary data for computing W_hhx into the FPGA on-chip memory;
  
  step 2 of computing W_hhh_t−
  
  1by processing elements of said FPGA, where h_t−
  
  1is an activation to be applied to the previous input vector by the hidden layer, and reading necessary data for computing the next W_hxx into the FPGA on-chip memory; and
  
  iteratively repeating said step 1 and step 2.
- View Dependent Claims (15, 16)
- - 15. The method of claim 14, wherein each of said Step 1 and Step 2 further comprises:
    - while computing matrix-vector multiplication for the present input vector, computing the activation h_tof hidden layers and RNN'"'"'s output y_t.
  - 16. The method of claim 14, wherein:
    - said initial step, step 1 and step 2 are processed sequentially; and
      
      said step 1 and step 2 are processed periodically.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xilinx, Inc. (Advanced Micro Devices, Inc.)
Original Assignee
Beijing Deephi Intelligence Technology Co., Ltd.
Inventors
KANG, Junlong, HAN, Song, Shan, Yi

Granted Patent

US 10,698,657 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 2207/4824   Neural networks

G06F 7/501   Half or full adders, i.e. b...

G06F 7/523   Multiplying only

G06F 7/5443   Sum of products for applica...

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/063   using electronic means

HARDWARE ACCELERATOR FOR COMPRESSED RNN ON FPGA

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

HARDWARE ACCELERATOR FOR COMPRESSED RNN ON FPGA

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links