SYSTOLIC CONVOLUTIONAL NEURAL NETWORK

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
2Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. A circuit for performing convolutional neural network computations for a neural network, the circuit comprising:
 a transposing buffer configured to receive actuation feature vectors along a first dimension of the transposing buffer and to output feature component vectors along a second dimension of the transposing buffer;
a weight buffer configured to store kernel weight vectors along a first dimension of the weight buffer and further configured to output kernel component vectors along a second dimension of the weight buffer; and
a systolic array configured to receive the kernel weight vectors along a first dimension of the systolic array and to receive the feature component vectors along a second dimension of the systolic array,where the systolic array comprises an array of multiply and accumulate (MAC) processing cells.
1 Assignment
0 Petitions
Accused Products
Abstract
A circuit and method are provided for performing convolutional neural network computations for a neural network. The circuit includes a transposing buffer configured to receive actuation feature vectors along a first dimension and to output feature component vectors along a second dimension, a weight buffer configured to store kernel weight vectors along a first dimension and further configured to output kernel component vectors along a second dimension, and a systolic array configured to receive the kernel weight vectors along a first dimension and to receive the feature component vectors along a second dimension. The systolic array includes an array of multiply and accumulate (MAC) processing cells. Each processing cell is associated with an output value. The actuation feature vectors may be shifted into the transposing buffer along the first dimension and output feature component vectors may shifted out of the transposing buffer along the second dimension, providing efficient dataflow.
2 Citations
Low latency matrix multiply unit  
Patent #
US 10,698,974 B2
Filed 05/17/2018

Current Assignee
Google LLC

Sponsoring Entity
Google LLC

Low latency matrix multiply unit  
Patent #
US 10,698,976 B2
Filed 08/01/2019

Current Assignee
Google LLC

Sponsoring Entity
Google LLC

No References
19 Claims
 1. A circuit for performing convolutional neural network computations for a neural network, the circuit comprising:
a transposing buffer configured to receive actuation feature vectors along a first dimension of the transposing buffer and to output feature component vectors along a second dimension of the transposing buffer; a weight buffer configured to store kernel weight vectors along a first dimension of the weight buffer and further configured to output kernel component vectors along a second dimension of the weight buffer; and a systolic array configured to receive the kernel weight vectors along a first dimension of the systolic array and to receive the feature component vectors along a second dimension of the systolic array, where the systolic array comprises an array of multiply and accumulate (MAC) processing cells.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
 13. A method for performing convolution neural network computations for a neural network, the method comprising:
loading input feature vectors into a transposing buffer along a first dimension of the transposing buffer; loading kernel weight vectors along a first dimension of a weight buffer; for each of a plurality of processing cycles; outputting kernel component vectors from a second dimension of the weight buffer to a first dimension of a systolic array, where the second dimension is perpendicular to the first dimension; outputting feature component vectors from a second dimension of the transposing buffer to a second dimension of the systolic array, where the second dimension is perpendicular to the first dimension and where the first dimension is perpendicular to the second dimension; and in each cell of the systolic array, accumulating a product of a feature component and a kernel component; and outputting accumulated products of the cells of the systolic array to an output layer of the neural network.  View Dependent Claims (14, 15, 16, 17, 18, 19)
1 Specification
Artificial neural networks (ANNs) have found application in many areas, from the Internet of Things (IoT) to large datacenters. ANNs can be computationally intensive, which has motivated the development of specialized hardware accelerators for ANNs. These accelerators have the potential for lower power and higher performance. In particular, convolutional neural networks (CNNs) have proven to be useful for a wide range of classification and regression applications, most notably object classification in natural images. The core computation required for CNNs is a threedimensional (3D) convolution, which may be implemented in software using an imagetocolumn (IM2COL) transformation of image data followed by generic matrix multiplication (GEMM) operation.
A hardware accelerator for a CNN may follow a similar approach, providing hardware acceleration for the IM2COL transformation and the GEMM operation. The most common approach for the GEMM operation is to use a systolic array, which consists of a twodimensional (2D) grid of multiplyaccumulate (MAC) units, each connected to its neighbors to pass operands and results in a regular fashion. Systolic arrays are efficient because the communication is kept local (register to register) for as long as possible, which reduces the number of static random access memory (SRAM) and main memory accesses. This approach is often referred to as ‘operand reuse’.
However, one of the challenges with using a systolic array is designing hardware to perform the “dataflow” needed to rearrange the input data (and output data) in a suitable pattern such that the correct computation is performed. The IM2COL operation is part of this data rearrangement. However, for application in small devices, such as IoT devices, there is a desire that the data rearrangement should have a simple memory layout in order to facilitate a practical implementation. In addition, operand reuse should be maximized where possible.
The accompanying drawings provide visual representations which will be used to more fully describe various representative embodiments and can be used by those skilled in the art to better understand the representative embodiments disclosed and their inherent advantages. In these drawings, like reference numerals identify corresponding elements.
The various apparatus and devices described herein provide a hardware accelerator for a convolutional neural network (CNN) that provides efficient operand reuse without requiring a complicated memory layout.
In accordance with certain representative embodiments of the present disclosure, there is provided a circuit for performing convolutional neural network computations for a neural network
A major application of convolutional neural network (CNNs) is the recognition of features in a digital image consisting of a matrix of pixel values. This may be a threedimensional matrix for a color image or a twodimensional matrix for a grayscale image. A CNN extracts features from an input image by performing a convolution operation between the image matrix and a matrix of values termed a ‘filter’, ‘kernel’ or ‘feature detector’. The convolution operation preserves the spatial relationship between pixels in the image by learning image features using small patches (usually squares) of input data. Thus, the core computation required for CNNs is a threedimensional (3D) convolution, which may be implemented in software using an imagetocolumn (IM2COL) transformation of image data followed by generic matrix multiplication (GEMM) operation.
A hardware accelerator for a CNN may follow a similar approach, providing hardware acceleration for the IM2COL transformation and the GEMM operation. The most common approach for performing the GEMM operation is to use a systolic array, which consists of a twodimensional (2D) grid of multiplyaccumulate (MAC) units, each connected to its neighbors to pass operands and results in a regular fashion. Systolic arrays are efficient because the communication is kept local (register to register) for as long as possible, which reduces the number of static random access memory (SRAM) and main memory accesses. This approach is often referred to as ‘operand reuse’.
In the embodiment shown, accelerator 100 includes a direct memory access (DMA) unit 112 that retrieves data to be processed from a host data processing system via memory management unit (MMU) 114 of the host data processing unit. MMU 114 may provide a Translation Buffer Unit (TBU) and/or Translation Lookaside Buffer (TLB) for memory address translation and caching memory page tables. MMU 114 may provide a datastreaming interface, for example, to receive image data and return inferred features. DMA unit 112 also retrieves commands that are stored in command buffer 116. The commands control operation of sequencer 118. In turn, sequencer 118 synchronizes operation of DMA unit 112, data transformer 104, systolic array 106 and output layer 110. Upon completion of an operation, sequencer 118 may signal a host data processing system via an interrupt request (IRQ) signal 120.
Control registers 122 may be provided. In the embodiment shown, these registers may be connected to the host data processing system via a bus 124. This may be an Advanced Highperformance Bus (AHB), for example. Thus, accelerator circuit 100 provides an interface to a host data processing system, through DMA unit 112 and bus 124, for example, that enables the accelerator to exchange data and commands with the host data processing system.
In general, circuit 100 is configured for performing convolutional neural network computations for a neural network. The circuit includes buffers 104 that include a transposing buffer and a weight buffer. The transposing buffer is twodimensional buffer that receives actuation feature vectors (patch values) along a first dimension and outputs feature component vectors along a second dimension to systolic array 106. The weight buffer is also a twodimensional buffer that is configured to store kernel weight vectors along a first dimension and to output kernel component vectors along a second dimension. As will be discussed below, systolic array 106 is configured to receive the kernel weight vectors along a first dimension and to receive the feature component vectors along a second dimension. The systolic array comprises an array of multiply and accumulate (MAC) processing cells.
By way of explanation, some implementations of the convolution accelerator 110 are described below with reference to a simple example. As will be apparent to those of ordinary skill in the art, these simple examples may be expanded to higher dimensions.
Each layer of a CNN receives, as input, a set of actuation features. These will be referred to herein as input features or as actuation features. For image processing, the actuation features comprise the image data itself. Each pixel in the image comprises one or more values. For example, an N×M image A may be written as an array of pixel feature vectors
where, for a color image with R, G, B components or channels, the pixel feature vector at position (x, y) is given by
a(x,y)=[a_{xy}^{R}a_{xy}^{G}a_{xy}^{B}]^{T}. (2)
The terms a_{xy}^{R}, a_{xy}^{G }and R_{xy}^{B }are the three components or channels of the feature a(x, y). The image is an example of an input feature map. More generally, an input feature map may have any number of components or channels, where the components are values of feature components. The disclosed hardware accelerator is not limited to the processing of image data and may be used generally to process data arrays of the form given in equation (1). The acceleration produces an output feature map, which may be processed further in the output processor before being fed back into the accelerator to iteratively produce new output feature maps.
By way of example, the description below is based upon image processing using a number of kernels or filters each with dimension 2×2×L. However, the extension to larger kernels will be apparent to those of ordinary skill in the art. The n^{th }kernel may be written as
and where f_{i,j}^{n }are weight component values. The terms f_{i}^{n }will be referred herein to as ‘kernel weight vectors’. Other size kernels may be represented in similar manner. Herein, lowercase bold type will be used to denote vector quantities, while uppercase bold type is used to denote matrix quantities. Each location in an image or feature map comprises a number of components or layers. A patch in the image or feature map is a collection of neighboring locations and is represented by a threedimensional block of data. The output is obtained by applying the kernel to patches in the image or feature map. In this simple example, the n^{th }component of the output obtained by applying the n^{th }kernel to patch p, q. The output is
where the indices i_{p}^{n}=i_{p}+i_{n}, j_{q}^{n}=j_{q}+j_{n}, denote coordinates of the elements in the patch, and where in and i_{n }and j_{n }offsets from a reference position (i_{p}, j_{q}) for the patch. The vectors in the inner product on the right side of (4) are given by
Multiple patches and L features may be computed together as a general matrix multiplication written as
However, it is noted that v_{p,q}^{T}k_{n}=k_{n}^{T}v_{p,q}, the roles of k and v may be reversed in the description that follows. The columns of the actuation matrix V (rows of V^{T}) are referred to herein as ‘actuation vectors’, while the rows of matrix V (columns of V^{T}) are referred to herein as ‘feature component vectors’. Similarly, the columns of the kernel matrix K are referred to herein as ‘kernel weight vectors’ while the rows are referred to as ‘kernel component vectors’.
The computation has two parts: (i) rearrangement of the image matrix (or input feature map) A into the actuation matrix V^{T }and (ii) computation of the matrix product P=V^{T}K. In accordance with the present disclosure, the rearrangement is performed by a data transpose buffer and the matrix product is computed by a corresponding systolic array. It is noted that the data transpose buffer and systolic array work together and their structures are interdependent. The number of rows in the actuation matrix V^{T }and number of kernels that can be computed in each one complete operation are determined by the dimensions of the systolic array. When the input feature map or number of kernels is larger than can be computed by the hardware, the computation may be broken down into a number of smaller computations.
Matrix Product
Operation of the systolic array is described below with reference to
The systolic array 106 uses data pipelining. Matrix A, stored in buffer 202, is clocked into the array from the left and moves one cell to right at each clock cycle, while matrix B, stored in buffer 204, is clocked in from above and moves one cell down the array at each clock cycle. Thus, at time t=1, the accumulator in cell (i, j) is initialized as S_{ij}^{1}=0, then performs the MAC operation S_{ij}^{t+1}=S_{ij}^{t}+a_{i,ti}b_{tj,j}, for t=1, 2, . . . , 6.
In a further broadcast approach, elements could be copied from cell to cell prior to the MAC operation. This approach may be implemented using simpler hardware, but requires many more clock cycles.
In each of the systolic arrays described above, the output has a stationary dataflow. That is, the computation for a given output is computed by a single processing element of the array. This is achieved by implementing the accumulator function inside the processing element, as discussed below with reference to
A systolic array may have any size and may be square or rectangular. For example, an array may comprise 32×32 cells or 256×256 cells arranged in a square. The transposing buffer may be sized to match the systolic array.
At the start of a computation of an element of the output feature map, partial sum of products register 514 is set to zero. Within a computation, multiplexer 520 is set to update the register with the new accumulated value 522. After completion of the computation, register 514 contains the accumulated sum of products of the cell while register 526 contains the accumulated sum of products of a neighboring cell. Multiplexer 520 may be set to enable accumulated sum of products, such as 524 received from register 526 of a neighboring cell, to be shifted out of the systolic array.
Operation of the cell may be controlled by control unit 528. The control unit may receive a control signal on line 530 from a register 532 of a neighboring cell. The control signal may be stored in register 534 for passing to neighboring cell to the right. The control signal may be, for example, a ‘clear’ signal to cause the partial sum of products register 514 to be reset to zero, or a signal to control multiplexer 520.
Inputs may be tagged with an additional bit to indicate when the data is invalid. The tag bit may be used in a cell, for example, to indicate when a partial sum of products register should be reset to zero.
Data Rearrangement
As discussed above, the convolution computation has two parts: (i) rearrangement of the image matrix (or input feature map) A into the actuation matrix V^{T }and (ii) computation of the matrix product P=V^{T}K. The systolic array described above performs the matrix product. The description below describes various embodiments of the rearrangement of an image or other feature map.
The kernels are applied, in turn, to blocks of data called patches. In some applications, the patches overlap one another in the input feature map. For example, for four overlapping patches along a row, the actuation matrix is
Each row of the matrix corresponds to one patch. In equation (8), the indices denote offsets from a common patch reference position. It is also noted that each entry in the matrix is a component row vector. For example, with R, G, B components the matrix has 4 rows (one for each patch) and 12 columns (3 components for each of the 4 map locations). Each column is referred to herein as a feature component vector. Since the matrix is fed into the systolic array one column at a time (i.e. one feature component vector at a time), the result can be obtained by consecutively feeding in the matrices
Again, the rows of V^{T }are referred to herein as ‘actuation vectors’, while the columns of V^{T }are referred to herein as ‘feature component vectors’. In equation (9), it is noted that V_{2}^{T }may be obtained from V_{1}^{T }by shifting out the first row and shifting in the new last row. Thus, the matrices may be conveniently implemented as a twodimension array of registers that allow registertoregister shifting in both dimensions. This transposing buffer is filled by rowbyrow by shifting in image data and then emptied columnbycolumn by shifting out into the systolic array. The accumulators of the systolic array are reset after all columns of both V_{1}^{T }and V_{2}^{T }have been loaded and processed. As mentioned above, the roles of k and v may be reversed. However, in either case, the feature components are stored along one dimension of the buffer while image data is (or other input feature map data) loaded along the other dimension.
Much of the data in matrix V_{1}^{T }is reused in matrix V_{2}^{T}. This provides improved efficiency of operation and reduces the amount of data access/movement needed.
The second input matrix 604 is obtained by shifting the rows upwards and inserting a new row at the bottom of the matrix. A single data staging operation is required, where a vector of zfirst data is loaded into a firstin, firstout (FIFO) for each row. As will be discussed in more detail below with reference to
The top row of the combined matrix corresponds to the first patch 606 in the input feature map 608 (before rearrangement), while the bottom row corresponds to the patch 610 in the input feature map 608. In operation, the matrix 602 is clocked into one edge of systolic array 612 one column at a time, starting with column 614. The matrix of kernels or filter weights 616 is clocked into systolic array 612 along a second edge. Each column of array 616 corresponds a set of filter weights. For example, the leftmost column contains filter weights from filter 618 and the next column contains filter weights from filter 620.
Each cell of systolic array 612 accumulates one element of output feature map 622. For example, first column 624 provides a first row of the first component of output feature map 622, while the last column 626 provides the first row of the last component of output feature map 622.
With nonoverlapping patches, the first four patches along a row of the image are computed from the actuation matrix
While there is no data reuse, the data rearrangement is may be efficiently loaded from a transposing buffer. The buffer may hold the complete actuation matrix, or sequential submatrices such as
The data rearrangement or staging required for the input feature map in the dataflow described above can be implemented as a simple transposing buffer structure. In particular, a transposing FIFO may be used. Unlike prior approaches, it is not necessary to use a multiported register file structure and address generators.
It is noted that the terms ‘row’ and ‘column’ may be interchanged in other embodiments. In general data is loaded along a first dimension and output along a second dimension, perpendicular to the first dimension.
When patches are overlapping, transpose buffer 700 may be operated as a FIFO buffer. One or more new data rows are shifted into row 702, each of the data columns is shifted out from row 704. Each row may be implemented as a circular buffer, with columns output from the left hand edge of buffer 700 reinserted at the right hand edge, as depicted by arrow 706. When the patches are nonoverlapping, all of the data may be shifted out and new data shifted after a computation is completed.
The transposing buffer 700 may be implemented in hardware using flipflops or other suitable electronic elements in an integrated circuit.
The weight matrix may be stored and operated in a similar manner, with weights for each filter loaded along one dimension and output to the systolic array along the other dimension.
Method of Operation
Other elements may be included in the operation of the accelerator, including checking valid bits, responding to commands, resetting accumulated values, etc. These are omitted from
While the present disclosure is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the disclosure and not intended to limit the disclosure to the specific embodiments shown and described. In the description below, like reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings.
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a nonexclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
Reference throughout this document to “one embodiment”, “certain embodiments”, “an embodiment” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.
The term “or” as used herein is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C”. An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Numerous details are set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, wellknown methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The description is not to be considered as limited to the scope of the embodiments described herein.
As used herein, the term processor, controller or the like may encompass a processor, controller, microcontroller unit (MCU), microprocessor, and other suitable control elements. It will be appreciated that embodiments of the disclosure described herein may be implemented an integrated circuit. Some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used. Thus, methods and means for these functions have been described herein. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation.
A convolutional Neural Network consistent with the architecture disclosed above may be described by instructions of a Hardware Description Language. These instruction may be stored on a nontransitory computer readable medium. This enables distribution of the instructions. The instructions may be combined with instructions that describe other components of a data processing system to enable design and manufacture of hardware of the system. The disclosed architecture may also be described by a netlist representative that, again, may be stored on a nontransitory computer readable medium.
Accordingly, some aspects and features of the disclosed embodiments are set out in the following numbered items:
1. A circuit for performing convolutional neural network computations for a neural network, the circuit comprising: a transposing buffer configured to receive actuation feature vectors along a first dimension of the transposing buffer and to output feature component vectors along a second dimension of the transposing buffer; a weight buffer configured to store kernel weight vectors along a first dimension of the weight buffer and further configured to output kernel component vectors along a second dimension of the weight buffer; and a systolic array configured to receive the kernel weight vectors along a first dimension of the systolic array and to receive the feature component vectors along a second dimension of the systolic array, where the systolic array comprises an array of multiply and accumulate (MAC) processing cells.
2. The circuit of item 1, where the feature component vectors and the kernel component vectors are pipelined into the systolic array.
3. The circuit of item 1, where the feature component vectors and the kernel component vectors are broadcast into the systolic array.
4. The circuit of item 1, where e actuation feature vectors are shifted into the transposing buffer along the first dimension of the transposing buffer and output feature component vectors are shifted out of the transposing buffer along the second dimension.
5. The circuit of item 1, where systolic array is further configured to pass the kernel weight vectors to neighboring processing cells in the second dimension of the systolic array and to pass the feature component vectors to neighboring processing cells in the first dimension of the systolic array.
6. The circuit of item 1, where systolic array is further configured to output values accumulated in the processing cells, where each processing cell is associated with an output value.
7. The circuit of item 1, further comprising an output layer configured to receive accumulated values from the MAC processing cells of the systolic array and to perform at least one nonlinear, pooling or normalization operations on the received accumulated values.
8. The circuit of item 1, where the values of the feature component vectors or the kernel component vectors are tagged with validity bits, indicative of data validity, and where an accumulator of a MAC processing cell is set to zero when data tagged as invalid is received.
9. The circuit of item 1, further comprising a control line coupled to the MAC processing cells, where an accumulator of a MAC processing cell is set to zero in response to a signal on the control line.
10. The circuit of item 1, further comprising an interface to a host data processing system, where the circuit is configured to receive data and commands from the host data processing system via the interface.
11. A nontransitory computer readable medium containing instructions of a hardware description language that define the circuit of item 1.
12. A nontransitory computer readable medium comprising a netlist representative of the circuit of item 1.
13. A method for performing convolution neural network computations for a neural network, the method comprising: loading input feature vectors into a transposing buffer along a first dimension of the transposing buffer; loading kernel weight vectors along a first dimension of a weight buffer; for each of a plurality of processing cycles: outputting kernel component vectors from a second dimension of the weight buffer to a first dimension of a systolic array, where the second dimension is perpendicular to the first dimension; outputting feature component vectors from a second dimension of the transposing buffer to a second dimension of the systolic array, where the second dimension is perpendicular to the first dimension and where the first dimension is perpendicular to the second dimension; and in each cell of the systolic array, accumulating a product of a feature component and a kernel component; and outputting accumulated products of the cells of the systolic array to an output layer of the neural network.
14. The method of item 13, further comprising, for each of the plurality of processing cycles: passing the kernel weight vectors to neighboring cells in the second dimension of the systolic array; and passing the feature component vectors to neighboring cells in the first dimension of the systolic array.
15. The method of item 13, further comprising, for each of the plurality of processing cycles: broadcasting the kernel weight vectors cells in the second dimension of the systolic array; and broadcasting the feature component vectors to cells in the first dimension of the systolic array.
16. The method of item 13, where loading input feature vectors into the transposing buffer along a first dimension of the transposing buffer comprises: shifting data stored in the transposing buffer in the second dimension; and loading an input feature vector along an edge of the transposing buffer in the first dimension.
17. The method of item 13, where outputting feature component vectors from the second dimension of the transposing buffer to the second dimension of the systolic array comprises: shifting data stored in the transposing buffer in the first dimension; and outputting a feature component vector along an edge of the transposing buffer in the second dimension.
18. The method of item 13, where a kernel weight vector is applied to a patch of pixels in an image, and where an input feature vector comprising color components of pixels in the patch; and a feature component vector comprises a color component of a corresponding pixel in each of a plurality of patches.
19. The method of item 13, where outputting accumulated sum of products of the cells of the systolic array to an output layer of the neural network comprises passing accumulated sum of products between neighboring cells of the systolic array to an edge of the systolic array.
The various representative embodiments, which have been described in detail herein, have been presented by way of example and not by way of limitation. It will be understood by those skilled in the art that various changes may be made in the form and details of the described embodiments resulting in equivalent embodiments that remain within the scope of the appended claims.