NEURAL NETWORK UNIT WITH MEMORY LAYOUT TO PERFORM EFFICIENT 3-DIMENSIONAL CONVOLUTIONS

US 20180157962A1
Filed: 12/01/2016
Published: 06/07/2018
Est. Priority Date: 12/01/2016
Status: Active Grant

First Claim

Patent Images

1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:

a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each;

a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each;

wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512;

an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each;

wherein the input blocks are held in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels;

wherein the filter blocks are held in R×

S×

C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×

S×

C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; and

wherein to convolve the input with the filters, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read a row of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A neural network unit convolves an H×W×C input with F R×S×C filters to generate F Q×P outputs. N processing units (PU) each have a register receiving a respective word of an N-word row of a second memory and multiplexed-register selectively receiving a respective word of an N-word row of a first memory or word rotated from an adjacent PU multiplexed-register. H first memory rows hold input blocks of B words each of channels of respective 2-dimensional input row slices. R×S×C second memory rows hold filter blocks of B words each holding P copies of a filter weight. B is the smallest factor of N greater than W. The PU blocks multiply-accumulate input blocks and filter blocks in column-channel-row order; they read a row of input blocks and rotate it around the N PUs while performing multiply-accumulate operations so each PU block receives each input block before reading another row.

Citations

21 Claims

1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:
- a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each;
  
  a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each;
  
  wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512;
  
  an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each;
  
  wherein the input blocks are held in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels;
  
  wherein the filter blocks are held in R×
  
  S×
  
  C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
  
  S×
  
  C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; and
  
  wherein to convolve the input with the filters, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read a row of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The neural network unit of claim 1,wherein to rotate the row around the N PUs while performing the portion of the multiply-accumulate operations such that each of the F PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory, the G PU blocks:
    - (a) perform S rotate-by-one and multiply-accumulate operations to accumulate a column-sum in each of the N accumulators;
      
      (b) perform one or more rotates to align the input blocks with the next adjacent PU block; and
      
      (c) repeat operations (a) and (b) for each of the at least C input blocks of the row to accumulate a column-channel-sum in each of the N accumulators.
  - 3. The neural network unit of claim 2,wherein to perform the multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, the G PU blocks further:
    - (d) perform the read of a row of the H rows of the at least C input blocks from the first memory;
      
      (e) perform operations (a), (b) and (c);
      
      (f) repeat operations (d) and (e) R times for a group of R of the H rows to accumulate a column-channel-row-sum in each of the N accumulators; and
      
      (g) write the N column-channel-row-sums to a row of the first or second memory.
  - 4. The neural network unit of claim 3,wherein to perform the multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, the G PU blocks further:
    - (h) repeat operations (d) through (g) Q times for Q different groups of R of the H rows.
  - 5. The neural network unit of claim 2,wherein to perform each of the S multiply-accumulate operations of operation (a), the G PU blocks further:
    - read a row of weights from the second memory for use in the multiply-accumulate operation.
  - 6. The neural network unit of claim 5,wherein for each PU block of the F PU blocks, the respective channel that specifies the row of words of the 2-dimensional slice held by the input block on which the S rotate-by-one and multiply accumulate operations of operation (a) are being performed corresponds to the respective channel of the filter block of the F of the G filter blocks of the row read from the second memory that holds the P copies of the weight.
  - 7. The neural network unit of claim 5,wherein the respective channel of each filter block of the F of the G filter blocks of the row of weights read from the second memory for use in the S multiply-accumulate operations of operation (a) corresponds to the respective channel of the input block rotated to align with the PU block performing the S multiply-accumulate operations of operation (a).
  - 8. The neural network unit of claim 1,wherein the at least C input blocks of the row include C input blocks each having a respective channel of the C channels and J gap input blocks;
    - andwherein J is G modulo C.
  - 9. The neural network unit of claim 8,wherein in addition to the R×
    - S×
      
      C rows of filter blocks, the second memory holds an additional R×
      
      S×
      
      J rows of filter blocks; and
      
      wherein the R×
      
      S×
      
      (C+J) rows of filter blocks include R×
      
      S×
      
      J×
      
      F gap filter blocks.
  - 10. The neural network unit of claim 9,wherein the gap filter blocks have zero values to cause zero contribution to the multiply-accumulate operations.
  - 11. The neural network unit of claim 8,wherein the J gap input blocks have zero values to cause zero contribution to the multiply-accumulate operations.
  - 12. The neural network unit of claim 1,wherein the input blocks of each of the H rows includes K copies of C input blocks each having a respective channel of the C channels;
    - andwherein K is a floor function of a quotient of G divided by C.

13. A method for operating a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each, a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each, wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512, and an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each, the method comprising:
- storing the input blocks in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels;
  
  storing the filter blocks in R×
  
  S×
  
  C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
  
  S×
  
  C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; and
  
  wherein said convolving the input with the filters comprises;
  
  performing, by the G PU blocks, multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order; and
  
  reading, by the G PU blocks, a row of the H rows of the at least C input blocks from the first memory and rotating the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The method of claim 13,wherein said rotating the row around the N PUs while performing the portion of the multiply-accumulate operations such that each of the F PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory comprises:
    - (a) performing, by the G PU blocks, S rotate-by-one and multiply-accumulate operations to accumulate a column-sum in each of the N accumulators;
      
      (b) performing, by the G PU blocks, one or more rotates to align the input blocks with the next adjacent PU block; and
      
      (c) repeating operations (a) and (b) for each of the at least C input blocks of the row to accumulate a column-channel-sum in each of the N accumulators.
  - 15. The method of claim 14,wherein said performing the multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, further comprises:
    - (d) performing, by the G PU blocks, the read of a row of the H rows of the at least C input blocks from the first memory;
      
      (e) performing operations (a), (b) and (c);
      
      (f) repeating operations (d) and (e) R times for a group of R of the H rows to accumulate a column-channel-row-sum in each of the N accumulators; and
      
      (g) writing the N column-channel-row-sums to a row of the first or second memory.
  - 16. The method of claim 15,wherein said performing the multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order further comprises:
    - (h) repeating operations (d) through (g) Q times for Q different groups of R of the H rows.
  - 17. The method of claim 14,wherein said performing each of the S multiply-accumulate operations of operation (a) further comprises:
    - reading, by the G PU blocks, a row of weights from the second memory for use in the multiply-accumulate operation.
  - 18. The method of claim 17,wherein for each PU block of the F PU blocks, the respective channel that specifies the row of words of the 2-dimensional slice held by the input block on which the S rotate-by-one and multiply accumulate operations of operation (a) are being performed corresponds to the respective channel of the filter block of the F of the G filter blocks of the row read from the second memory that holds the P copies of the weight.
  - 19. The method of claim 17,wherein the respective channel of each filter block of the F of the G filter blocks of the row of weights read from the second memory for use in the S multiply-accumulate operations of operation (a) corresponds to the respective channel of the input block rotated to align with the PU block performing the S multiply-accumulate operations of operation (a).
  - 20. The method of claim 13,wherein the at least C input blocks of the row include C input blocks each having a respective channel of the C channels and J gap input blocks;
    - andwherein J is G modulo C.wherein the input blocks are held in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels;
      
      wherein the filter blocks are held in R×
      
      S×
      
      C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
      
      S×
      
      C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; and
      
      wherein to convolve the input with the filters, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read a row of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory.

21. A computer program product encoded in at least one non-transitory computer usable medium for use with a computing device, the computer program product comprising:
- computer usable program code embodied in said medium, for specifying a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the computer usable program code comprising;
  
  first program code for specifying a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each;
  
  second program code for specifying a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each;
  
  wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512;
  
  third program code for specifying an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each;
  
  wherein the input blocks are held in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels;
  
  wherein the filter blocks are held in R×
  
  S×
  
  C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
  
  S×
  
  C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; and
  
  wherein to convolve the input with the filters, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read a row of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Via Alliance Semiconductor Co., Ltd.
Original Assignee
Via Alliance Semiconductor Co., Ltd.
Inventors
HENRY, G. GLENN, HOUCK, KIM C.

Granted Patent

US 10,438,115 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/063   using electronic means

NEURAL NETWORK UNIT WITH MEMORY LAYOUT TO PERFORM EFFICIENT 3-DIMENSIONAL CONVOLUTIONS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

NEURAL NETWORK UNIT WITH MEMORY LAYOUT TO PERFORM EFFICIENT 3-DIMENSIONAL CONVOLUTIONS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links