NEURAL NETWORK UNIT THAT PERFORMS EFFICIENT 3-DIMENSIONAL CONVOLUTIONS
First Claim
1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:
- at least one memory that outputs a row of N words, wherein N is at least 512;
an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register;
wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W;
for each output row of the Q output rows;
for each filter row of the R filter rows;
the NNU reads into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and
for at least each channel of the C channels;
for each filter column of the S filter columns;
the NNU reads into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block;
each PU of the array multiplies the register and the multiplexed-register to generate a product and accumulates the product with the accumulator; and
the NNU rotates the multiplexed-registers by one; and
the NNU rotates the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and
the NNU writes the N accumulators to the at least one memory.
1 Assignment
0 Petitions
Accused Products
Abstract
A neural network unit convolves a H×W×C input with F R×S×C filters to generate F Q×P outputs. N processing units (PU) each have a register receiving a memory word and a multiplexed-register selectively receiving a memory word or word rotated from an adjacent PU multiplexed-register. The N PUs are logically partitioned as G blocks each of B PUs. The PUs convolve in a column-channel-row order. For each filter column: the N registers read a memory row, each PU multiplies the register and the multiplexed-register to generate a product to accumulate, and the multiplexed-registers are rotated by one; the multiplexed-registers are rotated to align the input blocks with the adjacent PU block. This is performed for each channel. For each filter row, N multiplexed-registers read a memory row for the multiply-accumulations, F column-channel-row-sums are generated and written to the memory, then all steps are performed for each output row.
19 Citations
31 Claims
-
1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:
-
at least one memory that outputs a row of N words, wherein N is at least 512; an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register; wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W; for each output row of the Q output rows; for each filter row of the R filter rows; the NNU reads into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and for at least each channel of the C channels; for each filter column of the S filter columns;
the NNU reads into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block;
each PU of the array multiplies the register and the multiplexed-register to generate a product and accumulates the product with the accumulator; and
the NNU rotates the multiplexed-registers by one; andthe NNU rotates the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and the NNU writes the N accumulators to the at least one memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
17. A method for operating a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising at least one memory that outputs a row of N words, wherein N is at least 512, and an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W, the method comprising:
for each output row of the Q output rows; for each filter row of the R filter rows; reading, by the NNU, into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and for at least each channel of the C channels; for each filter column of the S filter columns;
reading, by the NNU, into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block;
multiplying, by each PU of the array, the register and the multiplexed-register to generate a product and accumulating the product with the accumulator; and
rotating, by the NNU, the multiplexed-registers by one; androtating, by the NNU, the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and writing, by the NNU, the N accumulators to the at least one memory.
-
31. A computer program product encoded in at least one non-transitory computer usable medium for use with a computing device, the computer program product comprising:
-
computer usable program code embodied in said medium, for specifying a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the computer usable program code comprising; first program code for specifying at least one memory that outputs a row of N words, wherein N is at least 512; second program code for specifying an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register; wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W; for each output row of the Q output rows; for each filter row of the R filter rows; the NNU reads into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and for at least each channel of the C channels; for each filter column of the S filter columns; the NNU reads into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block; each PU of the array multiplies the register and the multiplexed-register to generate a product and accumulates the product with the accumulator; and the NNU rotates the multiplexed-registers by one; and the NNU rotates the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and the NNU writes the N accumulators to the at least one memory.
-
Specification