NEURAL NETWORK UNIT THAT PERFORMS EFFICIENT 3-DIMENSIONAL CONVOLUTIONS

US 20180157966A1
Filed: 12/01/2016
Published: 06/07/2018
Est. Priority Date: 12/01/2016
Status: Active Grant

First Claim

Patent Images

1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:

at least one memory that outputs a row of N words, wherein N is at least 512;

an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register;

wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W;

for each output row of the Q output rows;

for each filter row of the R filter rows;

the NNU reads into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and

for at least each channel of the C channels;

for each filter column of the S filter columns;

the NNU reads into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block;

each PU of the array multiplies the register and the multiplexed-register to generate a product and accumulates the product with the accumulator; and

the NNU rotates the multiplexed-registers by one; and

the NNU rotates the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and

the NNU writes the N accumulators to the at least one memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A neural network unit convolves a H×W×C input with F R×S×C filters to generate F Q×P outputs. N processing units (PU) each have a register receiving a memory word and a multiplexed-register selectively receiving a memory word or word rotated from an adjacent PU multiplexed-register. The N PUs are logically partitioned as G blocks each of B PUs. The PUs convolve in a column-channel-row order. For each filter column: the N registers read a memory row, each PU multiplies the register and the multiplexed-register to generate a product to accumulate, and the multiplexed-registers are rotated by one; the multiplexed-registers are rotated to align the input blocks with the adjacent PU block. This is performed for each channel. For each filter row, N multiplexed-registers read a memory row for the multiply-accumulations, F column-channel-row-sums are generated and written to the memory, then all steps are performed for each output row.

19 Citations

31 Claims

1. A neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising:
- at least one memory that outputs a row of N words, wherein N is at least 512;
  
  an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register;
  
  wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W;
  
  for each output row of the Q output rows;
  
  for each filter row of the R filter rows;
  
  the NNU reads into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and
  
  for at least each channel of the C channels;
  
  for each filter column of the S filter columns;
  
  the NNU reads into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block;
  
  each PU of the array multiplies the register and the multiplexed-register to generate a product and accumulates the product with the accumulator; and
  
  the NNU rotates the multiplexed-registers by one; and
  
  the NNU rotates the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and
  
  the NNU writes the N accumulators to the at least one memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 2. The neural network unit of claim 1, further comprising:
    - the NNU clears the accumulators prior to performing any of the multiplies of the register and the multiplexed-register for said each output row of the Q output rows.
  - 3. The neural network unit of claim 1, further comprising:
    - wherein the multiplexed-register is further configured to selectively receive a word rotated from the multiplexed-register of one or more PUs other than the logically adjacent PU; and
      
      wherein to rotate the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs, the neural network unit causes the multiplexed-registers to receive the word rotated from the multiplexed-register of the one or more PUs other than the logically adjacent PU.
  - 4. The neural network unit of claim 1, further comprising:
    - wherein the at least one memory comprises;
      
      a first memory coupled to the N multiplexed-registers in which are stored the N-word wide rows of the input; and
      
      a second memory coupled to the N registers in which are stored the N-word wide rows of the weights.
  - 5. The neural network unit of claim 1, further comprising:
    - wherein the NNU is configured to read the at least one memory only as an entire row of the N words, not individual subunits of the N words within a row.
  - 6. The neural network unit of claim 1, further comprising:
    - wherein N, the number of PUs, is static based on hardware of the NNU, but W is a hyper-parameter of neural networks whose inputs the NNU convolves with the filters to generate the outputs, and therefore B and G are dynamic.
  - 7. The neural network unit of claim 1, further comprising:
    - wherein each of the G input blocks includes a row of a channel of the C channels of the input.
  - 8. The neural network unit of claim 1, further comprising:
    - wherein some of the G input blocks do not include a row of a respective channel of the C channels of the input but instead include null values such that the product generated is a null value accumulated with the accumulator.
  - 9. The neural network unit of claim 8, further comprising:
    - wherein said for at least each channel of the C channels comprises for C+X channels, wherein X is the number of the some of the G input blocks that include null values.
  - 10. The neural network unit of claim 1, further comprising:
    - wherein X of the G input blocks do not include a row of a respective channel of the C channels of the input; and
      
      wherein X of the G filter blocks corresponding to the X input blocks include null values such that the product generated is a null value accumulated with the accumulator.
  - 11. The neural network unit of claim 1, further comprising:
    - wherein with respect to said for each output row of the Q output rows, the output row has a zero-based output row index;
      
      wherein with respect to said for each filter row of the R filter rows, the filter row has a zero-based filter row index; and
      
      wherein the row of the respective channel of the input that the NNU reads into the multiplexed-registers has a zero-based index that is a sum of the output row index and the filter row index.
  - 12. The neural network unit of claim 1, further comprising:
    - wherein said the NNU rotates the multiplexed-registers to align the G input blocks with the adjacent G blocks of N PUs comprises the NNU rotates the multiplexed-registers by B minus S plus one.
  - 13. The neural network unit of claim 1, further comprising:
    - wherein each of the N words read into the N multiplexed-registers from the at least one memory has a first bit width, and wherein each of the N accumulators has a second bit width that is larger than the first bit width; and
      
      wherein said the NNU writes the N accumulators to the at least one memory comprises the NNU writes to the at least one memory N words of the first bit width that are a lesser precision representation of the corresponding N accumulators of the second bit width.
  - 14. The neural network unit of claim 1, further comprising:
    - wherein said for each output row of the Q output rows the NNU writes the N accumulators to the at least one memory without losing precision attributable to writing to the at least one memory intermediate partial sums and subsequently reading from the at least one memory the intermediate partial sums in order to generate the output row of the Q rows of the output.
  - 15. The neural network unit of claim 1, further comprising:
    - wherein N, the number of PUs, is static based on hardware of the NNU, but F is a hyper-parameter of neural networks whose inputs the NNU convolves with the filters to generate the outputs; and
      
      wherein when F is greater than G, then said for each output row of the Q output rows is performed T times, wherein T is a ceiling function of a quotient of F divided by G.
  - 16. The neural network unit of claim 1, further comprising:
    - wherein N, the number of PUs, is static based on hardware of the NNU, but C is a hyper-parameter of neural networks whose inputs the NNU convolves with the filters to generate the outputs;
      
      wherein when C is less than half G, V different groups of C of the G input blocks include a row of a respective channel of the C channels of the input; and
      
      wherein V is a floor function of a quotient of G divided by C.
  - 18. The method of claim 1, further comprising:
    - clearing, by the NNU, the accumulators prior to performing any of the multiplies of the register and the multiplexed-register for said each output row of the Q output rows.
  - 19. The method of claim 1, further comprising:
    - selectively receiving, by the multiplexed-register, a word rotated from the multiplexed-register of one or more PUs other than the logically adjacent PU; and
      
      wherein said rotating the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs comprises receiving, by the multiplexed-registers, the word rotated from the multiplexed-register of the one or more PUs other than the logically adjacent PU.
  - 20. The method of claim 1, further comprising:
    - wherein N, the number of PUs, is static based on hardware of the NNU, but W is a hyper-parameter of neural networks whose inputs the NNU convolves with the filters to generate the outputs, and therefore B and G are dynamic.
  - 21. The method of claim 1, further comprising:
    - wherein each of the G input blocks includes a row of a channel of the C channels of the input.
  - 22. The method of claim 1, further comprising:
    - wherein some of the G input blocks do not include a row of a respective channel of the C channels of the input but instead include null values such that the product generated is a null value accumulated with the accumulator.
  - 23. The method of claim 22, further comprising:
    - wherein said for at least each channel of the C channels comprises for C+X channels, wherein X is the number of the some of the G input blocks that include null values.
  - 24. The method of claim 1, further comprising:
    - wherein X of the G input blocks do not include a row of a respective channel of the C channels of the input; and
      
      wherein X of the G filter blocks corresponding to the X input blocks include null values such that the product generated is a null value accumulated with the accumulator.
  - 25. The method of claim 1, further comprising:
    - wherein with respect to said for each output row of the Q output rows, the output row has a zero-based output row index;
      
      wherein with respect to said for each filter row of the R filter rows, the filter row has a zero-based filter row index; and
      
      wherein the row of the respective channel of the input that the NNU reads into the multiplexed-registers has a zero-based index that is a sum of the output row index and the filter row index.
  - 26. The method of claim 1, further comprising:
    - wherein said rotating the multiplexed-registers to align the G input blocks with the adjacent G blocks of N PUs comprises rotating the multiplexed-registers by B minus S plus one.
  - 27. The method of claim 1, further comprising:
    - wherein each of the N words read into the N multiplexed-registers from the at least one memory has a first bit width, and wherein each of the N accumulators has a second bit width that is larger than the first bit width; and
      
      wherein said writing the N accumulators to the at least one memory comprises writing to the at least one memory N words of the first bit width that are a lesser precision representation of the corresponding N accumulators of the second bit width.
  - 28. The method of claim 1, further comprising:
    - wherein said for each output row of the Q output rows writing the N accumulators to the at least one memory without losing precision attributable to writing to the at least one memory intermediate partial sums and subsequently reading from the at least one memory the intermediate partial sums in order to generate the output row of the Q rows of the output.
  - 29. The method of claim 1, further comprising:
    - wherein N, the number of PUs, is static based on hardware of the NNU, but F is a hyper-parameter of neural networks whose inputs the NNU convolves with the filters to generate the outputs; and
      
      wherein when F is greater than G, then said for each output row of the Q output rows is performed T times, wherein T is a ceiling function of a quotient of F divided by G.
  - 30. The method of claim 1, further comprising:
    - wherein N, the number of PUs, is static based on hardware of the NNU, but C is a hyper-parameter of neural networks whose inputs the NNU convolves with the filters to generate the outputs;
      
      wherein when C is less than half G, V different groups of C of the G input blocks include a row of a respective channel of the C channels of the input; and
      
      wherein V is a floor function of a quotient of G divided by C.

17. A method for operating a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the neural network unit comprising at least one memory that outputs a row of N words, wherein N is at least 512, and an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register, wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W, the method comprising:
- for each output row of the Q output rows;
  
  for each filter row of the R filter rows;
  
  reading, by the NNU, into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and
  
  for at least each channel of the C channels;
  
  for each filter column of the S filter columns;
  
  reading, by the NNU, into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block;
  
  multiplying, by each PU of the array, the register and the multiplexed-register to generate a product and accumulating the product with the accumulator; and
  
  rotating, by the NNU, the multiplexed-registers by one; and
  
  rotating, by the NNU, the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and
  
  writing, by the NNU, the N accumulators to the at least one memory.

31. A computer program product encoded in at least one non-transitory computer usable medium for use with a computing device, the computer program product comprising:
- computer usable program code embodied in said medium, for specifying a neural network unit (NNU) configured to convolve an input of H rows by W columns by C channels with F filters each of R rows by S columns by C channels to generate F outputs each of Q rows by P columns, the computer usable program code comprising;
  
  first program code for specifying at least one memory that outputs a row of N words, wherein N is at least 512;
  
  second program code for specifying an array of N processing units (PU), wherein each PU of the array has an accumulator, a register configured to receive a respective word of the N words from a row of the at least one memory, a multiplexed-register configured to selectively receive a respective word of the N words from a row of the at least one memory or a word rotated from the multiplexed-register of a logically adjacent PU, and an arithmetic logic unit coupled to the accumulator, register and multiplexed-register;
  
  wherein the N PUs are logically partitioned as G blocks each of B respective PUs, wherein B is a smallest factor of N that is at least as great as W;
  
  for each output row of the Q output rows;
  
  for each filter row of the R filter rows;
  
  the NNU reads into the N multiplexed-registers from the at least one memory a row of N words logically partitioned as G input blocks corresponding to the G blocks of PUs, wherein at least C of the G input blocks include a row of a respective channel of the C channels of the input; and
  
  for at least each channel of the C channels;
  
  for each filter column of the S filter columns;
  
  the NNU reads into the N registers from the at least one memory a row of N words logically partitioned as G filter blocks corresponding to the G input blocks, wherein each of F filter blocks of the G filter blocks corresponds to a respective filter of the F filters and comprises at least Q copies of a weight of the respective filter at the filter column and the filter row and the respective channel of the corresponding input block;
  
  each PU of the array multiplies the register and the multiplexed-register to generate a product and accumulates the product with the accumulator; and
  
  the NNU rotates the multiplexed-registers by one; and
  
  the NNU rotates the multiplexed-registers to align the G input blocks with the adjacent G blocks of B PUs; and
  
  the NNU writes the N accumulators to the at least one memory.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Via Alliance Semiconductor Co., Ltd.
Original Assignee
Via Alliance Semiconductor Co., Ltd.
Inventors
HENRY, G. GLENN, HOUCK, KIM C.

Granted Patent

US 10,417,560 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 17/153 Multidimensional correlatio...

G06N 3/063 using electronic means

NEURAL NETWORK UNIT THAT PERFORMS EFFICIENT 3-DIMENSIONAL CONVOLUTIONS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

19 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

NEURAL NETWORK UNIT THAT PERFORMS EFFICIENT 3-DIMENSIONAL CONVOLUTIONS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links