NEURAL NETWORK UNIT WITH MEMORY LAYOUT TO PERFORM EFFICIENT 3-DIMENSIONAL CONVOLUTIONS
First Claim
1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:
- a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each;
a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each;
wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512;
an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each;
wherein the input blocks are held in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels;
wherein the filter blocks are held in R×
S×
C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
S×
C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; and
wherein to convolve the input with the filters, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read a row of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory.
1 Assignment
0 Petitions
Accused Products
Abstract
A neural network unit convolves an H×W×C input with F R×S×C filters to generate F Q×P outputs. N processing units (PU) each have a register receiving a respective word of an N-word row of a second memory and multiplexed-register selectively receiving a respective word of an N-word row of a first memory or word rotated from an adjacent PU multiplexed-register. H first memory rows hold input blocks of B words each of channels of respective 2-dimensional input row slices. R×S×C second memory rows hold filter blocks of B words each holding P copies of a filter weight. B is the smallest factor of N greater than W. The PU blocks multiply-accumulate input blocks and filter blocks in column-channel-row order; they read a row of input blocks and rotate it around the N PUs while performing multiply-accumulate operations so each PU block receives each input block before reading another row.
-
Citations
21 Claims
-
1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:
-
a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each; a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each; wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512; an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each; wherein the input blocks are held in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels; wherein the filter blocks are held in R×
S×
C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
S×
C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; andwherein to convolve the input with the filters, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read a row of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for operating a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each, a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each, wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512, and an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each, the method comprising:
-
storing the input blocks in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels; storing the filter blocks in R×
S×
C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
S×
C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; andwherein said convolving the input with the filters comprises; performing, by the G PU blocks, multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order; and reading, by the G PU blocks, a row of the H rows of the at least C input blocks from the first memory and rotating the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer program product encoded in at least one non-transitory computer usable medium for use with a computing device, the computer program product comprising:
-
computer usable program code embodied in said medium, for specifying a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the computer usable program code comprising; first program code for specifying a first memory configured to hold rows of N words logically partitioned as G input blocks of B words each; second program code for specifying a second memory configured to hold rows of N words logically partitioned as G filter blocks of B words each; wherein B is the smallest factor of N that is greater than W, and wherein N is at least 512; third program code for specifying an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the second memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the first memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G PU blocks of B PUs each; wherein the input blocks are held in H rows of the first memory, wherein each row of the H rows of the first memory holds a respective 2-dimensional slice of a corresponding row of the H rows of the input, wherein the respective 2-dimensional slice is held within at least C input blocks of the G input blocks, wherein each input block of the at least C input blocks holds a row of words of the 2-dimensional slice specified by a respective channel of the C channels; wherein the filter blocks are held in R×
S×
C rows of the second memory, wherein each filter block of F of the G filter blocks of each row of the R×
S×
C rows of the second memory holds P copies of a weight of a corresponding filter of the F filters at a respective row and a respective column and a respective channel of the corresponding filter; andwherein to convolve the input with the filters, the G PU blocks perform multiply-accumulate operations on the input blocks and filter blocks in a column-channel-row order, wherein the G PU blocks read a row of the H rows of the at least C input blocks from the first memory and rotate the row around the N PUs while performing a portion of the multiply-accumulate operations such that each of F of the G PU blocks receives each of the at least C input blocks of the row before reading another row of the H rows from the first memory.
-
Specification