NEURAL NETWORK PROCESSOR WITH A WINDOW EXPANDER CIRCUIT

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
0Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A neural network processor configured to perform convolution operations on input data and N by N matrices, wherein N is a positive integer greater than one, the neural network processor comprising:
 a plurality of multiplier circuits;
a window expander circuit comprising;
a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, wherein each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, wherein P is an integer equal to or greater than one and Q is an integer equal to or greater than N, anda second logic circuit configured to receive the first set of data elements and additional data elements corresponding to the subset of the input data from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits.
1 Assignment
0 Petitions
Accused Products
Abstract
Neural network processors including a window expander circuit and related methods are provided. The window expander circuit may include a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories. The window expander circuit may further include a second logic circuit configured to receive the first set of data elements and additional data elements corresponding to the at least the subset of the input data from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor.
0 Citations
No References
No References
20 Claims
 1. A neural network processor configured to perform convolution operations on input data and N by N matrices, wherein N is a positive integer greater than one, the neural network processor comprising:
a plurality of multiplier circuits; a window expander circuit comprising; a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, wherein each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, wherein P is an integer equal to or greater than one and Q is an integer equal to or greater than N, and a second logic circuit configured to receive the first set of data elements and additional data elements corresponding to the subset of the input data from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits.  View Dependent Claims (2, 3, 4, 5, 6, 7)
 8. A method in a neural network processor configured to perform convolution operations on input data and N by N matrices, wherein N is a positive integer greater than one, wherein the neural network comprises a plurality of multiply circuits, the method comprising:
automatically determining whether the input data received by the neural network processor requires expansion; and when the input data requires the expansion;
(1) storing a first set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, wherein each of a P number of data elements of the first set of the data elements is stored in each of the Q number of logical memories, wherein P is an integer equal to or greater than one and Q is an integer equal to or greater than N, (2) shifting the first set of data elements from the Q number of logical memories into a first column of an array structure and storing a second set of data elements, corresponding to the subset of the input data, in the Q number of logical memories, (3) shifting the first set of the data elements from the first column of the array structure into a second column of the array structure and shifting the second set of data elements from the Q number of logical memories into the first column of the array structure, and (4) repeating storing and shifting steps using additional data elements corresponding to the subset of the input data until the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits. View Dependent Claims (9, 10, 11, 12, 13, 14)
 15. A neural network processor configured to perform convolution operations on input data and N by N matrices, wherein N is a positive integer greater than one, the neural network processor comprising:
a plurality of multiplier circuits; a window expander circuit comprising; a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, wherein each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, wherein P is an integer equal to or greater than one and Q is an integer equal to or greater than N, and a second logic circuit configured to receive the first set of data elements from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor, wherein the second logic circuit comprises a rotate circuit and an array structure.  View Dependent Claims (16, 17, 18, 19, 20)
1 Specification
Neural network technology is used to perform complex tasks such as reading comprehension, language translation, image recognition, or speech recognition. Machine learning services, such as those based on Recurrent Neural Networks (RNNs), Convolution Neural Networks (CNNs), Long Short Term Memory (LSTM) neural networks, or Gated Recurrent Units (GRUs) have been deployed to perform such complex tasks. While these types of neural networks have been deployed, there is a need for continued improvement in the underlying architecture and corresponding instructions to perform these complex tasks.
In one example, the present disclosure relates to a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one. The neural network processor may include a plurality of multiplier circuits. The neural network processor may further include a window expander circuit. The window expander circuit may include a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, where P is an integer equal to or greater than one and Q is an integer equal to or greater than N. The window expander circuit may further include a second logic circuit configured to receive the first set of data elements and additional data elements corresponding to the at least the subset of the input data from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits.
In another example, the present disclosure relates to a method in a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one, where the neural network comprises a plurality of multiply circuits. The method may include automatically determining whether the input data received by the neural network processor requires expansion. The method may further include when the input data requires the expansion: (1) storing a first set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the first set of the data elements is stored in each of the Q number of logical memories, where P is an integer equal to or greater than one and Q is an integer equal to or greater than N, (2) shifting the first set of data elements from the Q number of logical memories into a first column of an array structure and storing a second set of data elements, corresponding to the subset of the input data, in the Q number of logical memories, (3) shifting the first set of the data elements from the first column of the array structure into a second column of the array structure and shifting the second set of data elements from the Q number of logical memories into the first column of the array structure, and (4) repeating storing and shifting steps using additional data elements corresponding to the subset of the input data until the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits.
In yet another example, the present disclosure relates to a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one. The neural network processor may include a plurality of multiplier circuits and a window expander circuit. The window expander circuit may include a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, where P is an integer equal to or greater than one and Q is an integer equal to or greater than N.
The window expander circuit may further include a second logic circuit configured to receive the first set of data elements from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor, where the second logic circuit comprises a rotate circuit and an array structure.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The present disclosure is illustrated by way of example and is not limited by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
Examples disclosed in the present example relate to neural network processors that include a window expander circuit. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are both widely used computational kernels in mainstream machine learning. CNNs and RNNs may be efficiently expressed in terms of matrixvector multiplication, however the parallelism and data structures inherent to each differs significantly. Therefore, it is challenging to produce a single teraflop scale computer architecture that efficiently computes both CNNs and RNNs. This problem is compounded when real time latency requirements are placed on the design. As a result, previous solutions have specialized for CNNs or RNNs without prioritizing strong performance on both. Certain examples disclosed in the present disclosure relate to using system, methods, and components that provide for efficient computation for both CNNs and RNNs. In particular, certain examples relate to the use of a hardware window expander circuit that can be used to expand input data received by the neural network processor.
As an example, the present disclosure describes a neural network processor that leverages the parallelism between individual output activations in a CNN to perform a limited form of matrixmatrix multiplication within an individual CNN evaluation. This parallelism is mapped onto a circuit in the form of an array of quasiindependent matrixvector multiplication tile engines that receive the same matrix data but different vector data. This approach allows for high utilization at batch=1 for CNN inputs, which in turn delivers high throughput at low latency. This approach is also enabled by a CNNaware instruction set architecture (ISA) that provides an informationdense expression of CNNs in the same assembly level code that can be used to express RNNs.
The neural network processors described in this disclosure may be implemented using portions or combinations of Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Erasable and/or Complex programmable logic devices (PLDs), Programmable Array Logic (PAL) devices, and Generic Array Logic (GAL) devices. An image file may be used to configure or reconfigure neural network processors, such as FPGAs. The image file or similar file or program may be delivered via a network link or a local link (e.g., PCIe) from a host CPU. Information included in an image file can be used to program hardware blocks of a neural network processor (e.g., logic blocks and reconfigurable interconnects of an FPGA) to implement desired functionality. Desired functionality can be implemented to support any service that can be offered via a combination of computing, networking, and storage resources such as via a data center or other infrastructure for delivering a service.
In one example, neural network processors (e.g., FPGAs) or groups of such neural network processors may be coupled to each other via a low latency network. A converged platform leveraging hundreds to thousands of such neural network processors (e.g., FPGAs) may advantageously offer: (1) significantly reduced training times from exploiting parallelism across hundreds of thousands of nodes, (2) enabling new training scenarios such as online learning insitu on live data, and (3) training models of unprecedented scale while leveraging flexible and fungible homogeneous FPGA resources in a hyperscale datacenter spanning hundreds of thousands of servers. In one example, such advantages may be obtained by exploiting unconventional data representations that may leverage the architecture of neural network processors, such as FPGAs.
The described aspects can also be implemented in cloud computing environments. Cloud computing may refer to a model for enabling ondemand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient ondemand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly. A cloud computing model can be composed of various characteristics such as, for example, ondemand selfservice, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud computing model may be used to expose various service models, such as, for example, Hardware as a Service (“HaaS”), Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”). A cloud computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
Machine learning services, such as those based on Recurrent Neural Networks (RNNs), Convolution Neural Networks (CNNs), Long Short Term Memory (LSTM) neural networks, or Gated Recurrent Units (GRUs) may be implemented using the neural network processors described in this disclosure. In one example, the servicerelated content or other information, such as words, sentences, images, videos, or other such content/information may be translated into a vector representation. The vector representation may correspond to techniques such as RNN, CNN, LSTM, or GRU. The deep learning models may be trained offline before service initialization and then may be deployed using the systems and neural network processors described in this disclosure.
In one example, the neural network model may comprise of many layers and each layer may be encoded as matrices or vectors of weights expressed in the form of coefficients or constants that have been obtained via offline training of a neural network. Programmable hardware logic blocks in the nodes may process the matrices or vectors to perform various operations, including multiply, add, and other operations against input vectors representing encoded information related to the service. In one example, the matrices or vectors of weights may be partitioned and pinned across multiple nodes by using techniques such as graph partitioning. As part of this process, a large neural network may be translated into an intermediate representation (e.g., a graph) and then the intermediate representation may be carved into smaller representations (e.g., subgraphs) and each of the matrices of weights corresponding to each subgraph may be pinned to a node'"'"'s onchip memories. In one example, the models may be translated into fixedsize matrices and vectors. This way, the nodes'"'"' resources may operate on the fixedsize matrices and vectors in parallel.
Taking the LSTM example, an LSTM network may comprise a sequence of repeating RNN layers or other types of layers. Each layer of the LSTM network may consume an input at a given time step, e.g., a layer'"'"'s state from a previous time step, and may produce a new set of outputs or states. In case of using the LSTM, a single chunk of content may be encoded into a single vector or multiple vectors. As an example, a word or a combination of words (e.g., a phrase, a sentence, or a paragraph) may be encoded as a single vector. Each chunk may be encoded into an individual layer (e.g., a particular time step) of an LSTM network. An LSTM layer may be described using a set of equations, such as the ones below:
i_{t}=σ(W_{xi}xt+W_{hi}h_{t1}+W_{hi}c_{t1}+b_{i }
f_{t}=σ(W_{xf}x_{t}+W_{hf}h_{t1}+W_{cf}c_{t1}+b_{f})
c_{t}=f_{t}c_{t1}i_{t }tanh(W_{xc}x_{t}+W_{hc}h_{t1}+b_{c})
o_{t}=σ(W_{xo}x_{t}+W_{ho}h_{t1}+W_{co}c_{t}+b_{o})
h_{t}=o_{t }tanh(c_{t})
In this example, inside each LSTM layer, the inputs and hidden states may be processed using a combination of vector operations (e.g., dotproduct, inner product, or vector addition) and nonlinear functions (e.g., sigmoids, hyperbolic, and tangents). In certain cases, the most compute intensive operations may arise from the dot products, which may be implemented using dense matrixvector and matrixmatrix multiplication routines. In one example, the processing of the vector operations and nonlinear functions may be performed in parallel.
In one example, individual neural network processors may send messages comprising packets directly to each other and thus this may allow the partitioning of even a single neural network across multiple neural network processors without incurring unacceptable latencies. For communicating the neural network processors may use a lightweight protocol, including, for example, RDMA. Parallelization could also be performed within a layer of a neural network by splitting neural weights across multiple neural network processors. As an example, a single CNN or RNN model (e.g., including LSTM weight matrices) may be partitioned and processed using neural network processors.
With continued reference to
MVM 110 may include a vector register file (VRF) 112, a matrix register file (MRF) 120, and tile engines (e.g., tile engines 114, 116, and 118). Tile engines may receive input matrix and input vector data from VRF 112. MVM 110 may further include format converters, as needed, including block floating point (BFP) to floating point (FP) converters. In one example, two internal BFP formats may be used by MVM 110 for expressing its input and output: BFP short, for vector and matrix storage, and BFP long for accumulation. In one example of MVM 110, BFP short may use 81.15 fixed point values with a shared 5 bit exponent, and BFP long may use q34.40 fixed point values with a shared 5 bit exponent. In this example, the matrixvector multiplication may result in BFP long, which may be converted back to a floatingpoint format as a final output stage. Thus, the example MVM 110 shown in
With continued reference to
Still referring to
Neural network processor 100 may be used to enable issuance of instructions that can trigger millions of operations using a small number of instructions. As an example, Table 1 below shows instructions corresponding to a fully parameterized LSTM:
Although Table 1 shows a certain number of instructions having a certain format, neural network processor 100 may execute more or fewer instructions having a different format to accomplish the same objectives.
Table 2 below shows how to compute a 1×1 convolution as part of a CNN evaluation.
As shown in the table above, the number of iterations over a chain of instructions for the computation may be specified. Next, as needed, the native dimension of each instruction chain may be scaled by a column scaling factor. And after reading the vector data from the vector register file it may be multiplied with the weights retrieved from the matrix register file. After performing additional operations as required by the CNN evaluation, the output may be provided. As an example, a pointwise Rectified Linear Unit (ReLU) operation may be performed for each element of the vector data.
Table 3 below shows how to compute an N×N convolution as part of a CNN evaluation. The instructions below that are similar to the 1×1 convolution are not described again. The Set2dWindows instruction may be used to set the total window size and then Setlterations instruction may be used to slide that window across the input volume. The *_inc instructions (e.g., v_rd_inc and v_add_inc) may be used to increment the instruction'"'"'s address based on the stride. As an example, a stride of 2 may result in the skipping of every other vector in the vector register file that is used to store vector data for operations, such as addition.
MRF 220 may include several matrix register files that may be configured to supply matrix data or elements to dot product units within each tile. Each multiplier may receive one vector element from VRF 210 per cycle and one matrix element from one of the matrix register files per cycle. The matrix elements may be delivered by a dedicated port of the matrix register file positioned adjacent to that multiplier. MRF 220 may be organized as follows: stored matrices may be divided into nativesized tiles and each tile may be stored in only a single tile engine. The matrix stored in a given tile engine may be viewed as an MRF bank. Each dot product unit may be associated with a subbank of the MRF that holds one row of each matrix tile in that MRF bank. Rows may be statically assigned to dot product units, such that the first dot product unit contains the first row of every matrix tile in the MRF bank. Finally, the elements of the row may be interleaved in an SRAM such that the SRAM read port can be directly connected to multiplier lanes by wires alone. The writes to the matrix register file may be handled differently since matrix data for writing to MRF 220 may come from offchip memory, such as DRAM. Although
From an operational point of view, as described above, MVM 200 instantiates a series of matrixvector tiles, each of which are designed to accelerate a nativesized MVM. In turn, each tile engine includes a series of dot product engines. In one example, this may be accomplished using a hierarchical decode and dispatch architecture. Thus, in a case where neural network processor 100 is implemented based on an FPGA, a control processor may be realized using an offtheshelf Nios II/f processor that is paired with custom code. A toplevel scheduler associated with the control processor may receive a stream of instructions that may be grouped in chains. After decoding the instructions, the toplevel scheduler may dispatch distributed control signals to a set of secondlevel schedulers and to another set of secondlevel decoders. These secondlevel schedulers and decoders may dispatch additional distributed control signals to the lowest level decoders. In the example implementation using the Nios processor, the Nios processor may stream T iterations of N instructions into the toplevel scheduler. Next, the toplevel scheduler may dispatch the MVMspecific portion of instructions to a secondlevel scheduler, which may expand operations along the target matrix'"'"'s N rows and N columns. These MVM schedules may be mapped to matrixvector tile engines and the operations may be dispatched to a set of decoders each for the tile engines and their associated vector register files and accumulation units. The set of decoders may generate control signals that fan out into the data plane, with each tile engine dispatcher fanning out to hundreds of dot product units that may read the vector data from the vector register file and write the vector data back to the vector register file.
Native vectors may have a size of 1 by N and native matrices may have a size of N by N, and all instructions for neural network processor 100 may operate only on nativesized data. Logical vectors and matrices in applications may often be larger than the native size; in these cases, the vectors and matrices may be broken up into nativesized tiles. Conversely, in certain instances logical vectors may be much smaller than the native size. As an example, image classification inference models, such as ImageNet models (e.g., ResNet50) may have layers where the logical vectors are relatively smaller than the nativesized data. As an example,
Custom convolution hardware requires many dedicated multiplyaccumulate (MAC) units for high performance, but these units would otherwise sit idle when operations other than convolutions are executed. Instead, certain example neural network processors described in the present disclosure only move the window expansion component to custom hardware and transform the convolution operation into a GEMM operation, allowing existing MAC units to be used and enabling hardware optimized for GEMMs to execute convolutions efficiently. One of the aspects to accomplish this approach is the design of the window expansion array, which allows for highthroughput expansion (up to 1 expanded window per cycle) at custom window sizes and strides with low area cost.
With continued reference to
Window expander circuit 700 may include a finite state machine (FSM) 710 configured to receive input data from either a highspeed link (e.g., PCI express) or another source. For convolution operations based on assumptions shown in Table 4, window expander circuit 700 may further include seven logical SRAMs (SRAM 0, SRAM 1, SRAM 2, SRAM 3, SRAM 4, SRAM 5, and SRAM 6). Two features per entry may be stored in each of these SRAMs to handle stride 2 (e.g., in a similar fashion as explained with respect to
Thus, the operation of window expander circuit 700 is explained by taking an example of an image processing application in which RGB pixel values are the features that are being convolved with kernels. Also, for this example, it is assumed that each pixel is expressed in 16bit values. Once the pipeline including the pixel values is full, each cycle the window expander circuit may produce one expanded window input based on the assumptions for the convolution operation as shown in Table 4. In this example, the expanded image input data (e.g., pixel values) may be equal to 7×7×3=147 16bit values. For the window expander circuit 700 shown in
Step 1620 may include when the input data requires the expansion: (1) storing a first set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the first set of the data elements is stored in each of the Q number of logical memories, where P is an integer equal to or greater than one and Q is an integer equal to or greater than N, (2) shifting the first set of data elements from the Q number of logical memories into a first column of an array structure and storing a second set of data elements, corresponding to the subset of the input data, in the Q number of logical memories, (3) shifting the first set of the data elements from the first column of the array structure into a second column of the array structure and shifting the second set of data elements from the Q number of logical memories into the first column of the array structure, and (4) repeating storing and shifting steps using additional data elements corresponding to the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits.
In this example, this step may be performed using window expander circuit 720. Thus, as explained earlier with respect to
In conclusion, the present disclosure relates to a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one. The neural network processor may include a plurality of multiplier circuits. The neural network processor may further include a window expander circuit. The window expander circuit may include a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, where P is an integer equal to or greater than one and Q is an integer equal to or greater than N. The window expander circuit may further include a second logic circuit configured to receive the first set of data elements and additional data elements corresponding to the subset of the input data from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits.
In this example, the first logic circuit may include a finite state machine configured to store the data elements corresponding to the at least the subset of the input data into the each of the Q logical memories. Each of the Q logical memories may comprise a randomaccess memory.
The second logic circuit may comprise a rotate circuit and an array structure. In addition, the vector register file may be configured to store expanded data.
In this example, the neural network processor may further be configured to receive the input data via a PCI express bus. In another example, the neural network processor may further be configured to receive the input data from a vector data memory, where the vector data memory is configured to receive the input data via a PCI express bus.
In another example, the present disclosure relates to a method in a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one, where the neural network comprises a plurality of multiply circuits. The method may include automatically determining whether the input data received by the neural network processor requires expansion. The method may further include when the input data requires the expansion: (1) storing a first set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the first set of the data elements is stored in each of the Q number of logical memories, where P is an integer equal to or greater than one and Q is an integer equal to or greater than N, (2) shifting the first set of data elements from the Q number of logical memories into a first column of an array structure and storing a second set of data elements, corresponding to the subset of the input data, in the Q number of logical memories, (3) shifting the first set of the data elements from the first column of the array structure into a second column of the array structure and shifting the second set of data elements from the Q number of logical memories into the first column of the array structure, and (4) repeating storing and shifting steps using additional data elements corresponding to the subset of the input data until the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits.
The storing and the shifting steps may be performed using a window expander circuit comprising a first logic circuit, where the first logic circuit comprises a finite state machine configured to store the data elements corresponding to the at least the subset of the input data into the each of the Q logical memories. Each of the Q logical memories may include a randomaccess memory. The window expander circuit may further include a rotate circuit coupled between the Q logical memories and the array structure.
The method may further include storing expanded data into a vector register file corresponding to the neural network processor. The method may further include receiving the input data via a PCI express bus. In another example, the method may include receiving the input data from a vector data memory, where the vector data memory is configured to receive the input data via a PCI express bus.
In yet another example, the present disclosure relates to a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one. The neural network processor may include a plurality of multiplier circuits and a window expander circuit. The window expander circuit may include a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, where each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, where P is an integer equal to or greater than one and Q is an integer equal to or greater than N.
The window expander circuit may further include a second logic circuit configured to receive the first set of data elements from the Q number of logical memories and expand the at least the subset of the input data until the at least the subset of the input data is expanded based on a predetermined factor, where the second logic circuit comprises a rotate circuit and an array structure.
The first logic circuit may comprise a finite state machine configured to store the data elements corresponding to the subset of the input data into the each of the Q logical memories. Each of the Q logical memories may comprise a randomaccess memory.
In addition, the rotate circuit may be configured to selectively rotate the at least the subset of the input data before providing the at least the subset of the input data to the array structure. The extent of the rotation of the at least the subset of the input data may be determined based on a stride associated with the convolution operations.
It is to be understood that the methods, modules, and components depicted herein are merely exemplary. Alternatively, or in addition, the functionally described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include FieldProgrammable Gate Arrays (FPGAs), ApplicationSpecific Integrated Circuits (ASICs), ApplicationSpecific Standard Products (ASSPs), SystemonaChip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In an abstract, but still definite sense, any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “coupled,” to each other to achieve the desired functionality.
The functionality associated with some examples described in this disclosure can also include instructions stored in a nontransitory media. The term “nontransitory media” as used herein refers to any media storing data and/or instructions that cause a machine to operate in a specific manner. Exemplary nontransitory media include nonvolatile media and/or volatile media. Nonvolatile media include, for example, a hard disk, a solidstate drive, a magnetic disk or tape, an optical disk or tape, a flash memory, an EPROM, NVRAM, PRAM, or other such media, or networked versions of such media. Volatile media include, for example, dynamic memory, such as, DRAM, SRAM, a cache, or other such media. Nontransitory media is distinct from, but can be used in conjunction with transmission media. Transmission media is used for transferring data and/or instruction to or from a machine. Exemplary transmission media, include coaxial cables, fiberoptic cables, copper wires, and wireless media, such as radio waves.
Furthermore, those skilled in the art will recognize that boundaries between the functionality of the above described operations are merely illustrative. The functionality of multiple operations may be combined into a single operation, and/or the functionality of a single operation may be distributed in additional operations. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
Although the disclosure provides specific examples, various modifications and changes can be made without departing from the scope of the disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure. Any benefits, advantages, or solutions to problems that are described herein with regard to a specific example are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles.
Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.