ARITHMETIC PROCESSING DEVICE, LEARNING PROGRAM, AND LEARNING METHOD

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
0Forward
Citations 
0
Petitions 
1
Assignment
First Claim
1. An arithmetic processing device comprising:
 an arithmetic circuit;
a register which stores operation output data that is output by the arithmetic circuit;
a statistics acquisition circuit which generates, from subject data that is either the operation output data or normalization subject data, a bit pattern indicating a position of a leftmost set bit for positive number or a position of a leftmost zero bit for negative number of the subject data; and
a statistics aggregation circuit which generates either positive statistical information or negative statistical information, or both positive and negative statistical information, by separately adding up a first number at respective bit positions of the leftmost set bit indicated by the bit pattern of each of a plurality of subject data having a positive sign bit and a second number of at respective bit positions of leftmost zero bit indicated by the bit pattern of each of a plurality of subject data having a negative sign bit.
1 Assignment
0 Petitions
Accused Products
Abstract
An arithmetic processing device includes an arithmetic circuit; a register storing operation output data; a statistics acquisition circuit generating, from subject data being either the operation output data or normalization subject data, a bit pattern indicating a position of a leftmost set bit for positive number or a position of a leftmost zero bit for negative number of the subject data, the leftmost bit being a bit different from a sign bit; and a statistics aggregation circuit generating either positive or negative statistical information, or both positive and negative statistical information, by separately adding up a first number at respective bit positions of the leftmost set bit indicated by the bit pattern of each of a plurality of subject data having a positive sign bit and a second number of at respective bit positions of the leftmost zero bit indicated by the bit pattern of each of a plurality of subject data having a negative sign bit.
0 Citations
No References
No References
12 Claims
 1. An arithmetic processing device comprising:
an arithmetic circuit; a register which stores operation output data that is output by the arithmetic circuit; a statistics acquisition circuit which generates, from subject data that is either the operation output data or normalization subject data, a bit pattern indicating a position of a leftmost set bit for positive number or a position of a leftmost zero bit for negative number of the subject data; and a statistics aggregation circuit which generates either positive statistical information or negative statistical information, or both positive and negative statistical information, by separately adding up a first number at respective bit positions of the leftmost set bit indicated by the bit pattern of each of a plurality of subject data having a positive sign bit and a second number of at respective bit positions of leftmost zero bit indicated by the bit pattern of each of a plurality of subject data having a negative sign bit.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
 11. A nontransitory computerreadable storage medium storing therein a learning program for causing a computer to execute a learning process in a deep neural network, the learning process comprising:
reading, from a memory, statistical data of a histogram having, as a number of respective bins, a number at respective bit positions of a leftmost set bit for positive number or a position of a leftmost zero bit for negative number within subject data that is either a plurality of operation output data output by an arithmetic circuit or a plurality of normalization subject data, calculating a mean value and a variance value of the subject data on the basis of the number of the respective bins, and approximate values each corresponding to the position of the leftmost set bit for positive number or a position of a leftmost zero bit for negative number of the subject data, and performing a normalization operation on the subject data on the basis of the mean value and the variance value.
 12. A learning method for causing a processor to execute a learning process in a deep neural network, the learning process comprising:
reading, from a memory, statistical data of a histogram having, as a number of respective bins, a number at respective bit positions of a leftmost set bit for positive number or a position of a leftmost zero bit for negative number within subject data that is either a plurality of operation output data output by an arithmetic circuit or normalization subject data, calculating a mean value and a variance value of the subject data on the basis of the number of the respective bins and approximate values each corresponding to the position of the leftmost set bit for positive number or a position of a leftmost zero bit for negative number of the subject data, and performing a normalization operation on the subject data on the basis of the mean value and the variance value.
1 Specification
This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2018200993, filed on Oct. 25, 2018 the entire contents of which are incorporated herein by reference.
The present invention relates to an arithmetic processing device, a learning program, and a learning method,
Deep learning (abbreviated to DL hereafter) is machine learning using a multilayer neural network. A deep neural network (abbreviated to DNN hereafter) is a network on which an input layer, a plurality of hidden layers, and an output layer are arranged sequentially. Each layer carries a single node or a plurality of nodes, and each node carries a value. The nodes on a certain layer and the nodes of the next layer are joined by edges, and each edge carries a variable (a parameter) known as a weight or a bias.
In a DNN, the values of the nodes on the respective layers are determined by executing predetermined arithmetic based on the value of the node on the preceding layer, the weight of the edge, and so on. When input data are input into the nodes of the input layer, the values of the nodes on the next layer are determined by a first predetermined arithmetic, whereupon the values of the nodes on further next layer are determined by a second predetermined arithmetic using data determined by the first predetermined arithmetic as input. The values of the nodes on the output layer, i.e. the final layer, serve as output data in relation to the input data.
In a DNN, batch normalization, in which a normalization layer for normalizing the output data of the preceding layer on the basis of the mean and the variance thereof is inserted between the current layer and the preceding layer and the output data are normalized in learning processing units (minibatch units), is performed. By inserting a normalization layer, bias in the distribution of the output data is corrected, and as a result, learning over the entire DNN proceeds efficiently. For example, in a DNN on which image data are used as the input data, a normalization layer is often provided after a convolution layer on which a convolution operation to the image data is performed.
Further, in a DNN, the input data are also normalized. In this case, a normalization layer is provided immediately after the input layer, the input data are normalized in learning units, and learning is executed on the normalized input data. In so doing, bias in the distribution of the input data is corrected, and as a result, learning over the entire DNN proceeds efficiently.
DNN is disclosed in Japanese Laidopen Patent Publication No. 2017120609, Japanese Laidopen Patent Publication No. H07121656 and Japanese Laidopen Patent Publication No, 2018124681
In recent DNNs, in order to improve the recognition performance or the accuracy of the DNN, the amount of learning data is tend to increase. As a result of this increase, the calculation load on the DNN increases, leading to an increase in learning time and an increase in the load on a memory of a computer that executes operations in the DNN.
This problem applies similarly to the operation load of the normalization layer. For example, in a divisive normalization operation, the mean of the data values is determined, the variance of the data values is determined on the basis of the mean, and a normalization operation based on the mean and the variance is performed on the data values. When the number of minibatches increases in accordance with an increase in learning data, the resulting increase in the calculation load of the normalization operation leads to an increase in learning time and so on.
On aspect of the present embodiment is an arithmetic processing device including an arithmetic circuit; a register which stores operation output data that is output by the arithmetic circuit; a statistics acquisition circuit which generates, from subject data that is either the operation output data or normalization subject data, a bit pattern indicating a position of a leftmost set bit for positive number or a position of a leftmost zero bit for negative number of the subject data; and a statistics aggregation circuit which generates either positive statistical information or negative statistical information, or both positive and negative statistical information, by separately adding up a first number at respective bit positions of the leftmost set bit indicated by the bit pattern of each of a plurality of subject data having a positive sign bit and a second number of at respective bit positions of leftmost zero bit indicated by the bit pattern of each of a plurality of subject data having a negative sign bit.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
The convolution layer 11 performs a multiplyandaccumulate operation including multiplying internode weights or the like and pixel data of an image input into the plurality of nodes in the input layer 10 and accumulating the multiplied values, for example, and outputs pixel data of an output image having the features of the image to each of a plurality of nodes in the convolution layer 11.
The batch normalization layer 12 normalizes the pixel data of the output image output to the plurality of nodes in the convolution layer 11 in order to suppress distribution bias, for example. The activation function layer 13 then inputs the normalized pixel data into an activation function and generates corresponding output. The batch normalization layer 16 performs a similar normalization operation as well.
As described above, by normalizing the distribution of the pixel data of the output image, bias in the distribution of the pixel data is corrected, and as a result, learning over the entire DNN proceeds efficiently.
As illustrated in 2, as preparation, the plurality of training data are rearranged (S1) and the plurality of rearranged training data are divided into a plurality of minibatches (S2). Then, in the learning processing, forward propagation processing S4, error evaluation S5, backpropagation processing S6, and parameter update processing S7 are executed repeatedly on the plurality of divided minibatches (NO in S3). When processing of all of the minibatches is complete (YES in S3), a learning rate of the learning processing is updated (S8), whereupon the processing of S1 to S7 is executed repeatedly on the same training data until a specified number of times is reached (NO in S9).
Further, rather than repeating the processing of S1 to S7 on the same learning data until the specified number of times is reached, the learning processing is also terminated when an evaluation value of the learning result, for example the sum of squares of the difference (the error) between the output data and the correct data, converges on a fixed range.
In the forward propagation processing S4, operations are executed on each layer in order from the input side to the output side of the DNN. To illustrate this using
Next, in the error evaluation processing S5, the sum of squares of the difference between the output data of the DNN and the correct data is calculated as an error. The error is then backpropagated from the output side to the input side of the DNN (S6) In the parameter update processing S7, the weights and so on of each layer are optimized in order to minimize the backpropagated error of each layer. Optimization of the weights and so on is implemented by varying the weights and so on using a gradient descent method.
In the DNN, the plurality of layers may be formed from hardware circuits so that the operations of the respective layers are executed by the hardware circuits. Alternatively, the DNN may be formed by causing a processor to execute a program for executing the operations of the respective layers of the DNN.
According to the convolution arithmetic expression depicted in
Input images are input into the input layer of the DNN in a number corresponding to the number of channels C, and as a result of the operation performed on the convolution layer, output images are output in a number corresponding to the number of filters d and the number of biases d. Similarly, on a convolution layer provided on an intermediate layer of the DNN, images are input into the preceding layer in a number corresponding to the number of channels C, and as a result of the operation performed on the convolution layer, output images are output in a number corresponding to the number of filters d and the number of biases d.
In the DNN, the normalization layer is a layer for normalizing the plurality of output data from the layer prior to the normalization layer on the basis of the mean and the variance thereof. In the normalization of the example depicted in
In
Hence, during batch normalization, a large number of operations are performed, leading to an increase in the overall number of learning operations. For example, when the number of output data samples is M, addition (including subtraction) is performed M times and division is performed once in the operation for determining the mean. Further, in the operation for determining the variance, addition is performed 2M times, multiplication is performed M times, and division is performed once. Then, to normalize the M samples of output data on the basis of the mean and the variance, subtraction and division are each performed M times, while square root determination is performed once.
Further, when the image size is H×H, the number of channels is D, and the number of images in the batch is K, the total number of output data samples to be normalized is H*H*D*K, leading to a dramatic increase in the number of the operations described above.
Note that normalization processing may be performed on the input data of the learning data as well as on the output data of the convolution layer of the DNN and so on. In this case, the total number of input data samples is H*H*C*K, which is a number acquired by multiplying the number of pixels H*H of a number of input images corresponding to the number of channels C of the training data by the number of training data samples K.
In this embodiment, either operation output data generated by an arithmetic unit or normalization subject data such as input data will be referred to as subject data. In this embodiment, statistical information about the subject data is acquired in order to simplify the normalization operation.
An embodiment described below relates to a method for reducing the number of operations performed during normalization.
In the operation S14 performed on the convolution layer and the batch normalization layer, a convolution operation for determining the value (the output data) of each pixel of all of the output images in one minibatch is repeated a number of times corresponding to the number of output data samples in one minibatch (S141). Here, the number of output data (samples) in one minibatch is the number of pixels in all of the output images generated from the input images of the plurality of training data in one minibatch.
First, the scalar arithmetic unit provided in the DL execution processor executes a convolution operation between an input data sample, which is a pixel value of an input image, and the weight of a filter using a bias, thereby calculating the value (the operation output data) of one pixel of the output image (S142). Next, the DL execution processor acquires statistical information relating to positive operation output data and negative operation output data and adds the acquired positive and negative statistical information respectively to cumulative addition values of acquired positive and negative statistical information (S143). The convolution operation S142 and the operation S143 for acquiring and cumulatively adding the statistical information described above are performed by hardware such as the scalar arithmetic unit of the DL execution processor on the basis of a DNN operation program.
Once the processing of S142, S143 has been performed a number of times corresponding to the number of output data (samples) in one minibatch, the DL execution processor replaces the respective values of the operation output data with approximate values of respective bins of the statistical information, executes a normalization operation, and outputs the normalized output data (S144). Since the values of the operation output data belonging to the same bin are replaced with an approximate value of the corresponding bin, the mean and the variance of the output data, which are used during normalization, can be calculated easily on the basis of the approximate values and the number of data samples belonging to the bins. The processing of S144 constitutes the operation performed on the batch normalization layer.
For example, 20 (−8 to +11), which is the number of bins on the horizontal axis, corresponds to 20 bits of binary operation output data. Data samples within “0 0000 0000 1000.0000 0000 to 0 0000 0000 1111.1111 1111”, among operation output data (a fixedpoint number) to which a sign bit has been added, are included in bin number “3” on the horizontal axis. In this case, the position of the leftmost set bit for positive number or the leftmost zero bit for negative number of the operation output data corresponds to “3”. For example, an approximate value of the operation output data in bin number “3” is 2^{3 }(=8 in base 10), i.e., the minimum value of “0 0000 0000 1000.0000 0000 to 0 0000 0000 1111.1111 1111”.
The leftmost set bit for positive number or the leftmost zero bit for negative number may called as the leftmost nonsign bit. Here, the nonsign bit denotes either 1 or 0 in contrast to a sign bit of 0 (positive) or 1 (negative). In a positive number, the sign bit is 0, and therefore the nonsign bit is 1. In a negative number, the sign bit is 1, and therefore the nonsign bit is 0. The nonsign bit is a bit different from the sign bit.
When the operation output data are expressed as a fixedpoint number, each of the bins on the horizontal axis of the histogram corresponds to a position of the leftmost set bit for positive number or the leftmost zero bit for negative number. In this case, the bin to which each operation output data sample belongs can easily be detected simply by detecting the leftmost set bit for positive number or the leftmost zero bit for negative number of the operation output data sample. When the operation output data are expressed as a floatingpoint number, on the other hand, each of the bins on the horizontal axis of the histogram corresponds to the value (the number of digits) of the significand. In this case also, the bin to which each operation output data sample belongs can easily be detected.
In this embodiment, the number of samples (or data) in each bin on the histogram, corresponding to the digits of the output data, as illustrated in
When the output data samples belonging to bin “3” of the histogram depicted in
Σ(2^{3}=<X<2^{4})=1647*2^{3 }
The histogram (the numbers of data (or samples) belonging to the bins) S_{p}[N] of the positive subject data denotes the number of data (or samples) belonging to
2^{e+i}≤X<2^{e+i+1},
Further, the histogram (the numbers of data (or samples) belonging to the bins) S_{n}[N] of the negative subject data denotes the number of data samples belonging to
−2^{e+i+1}<X≤−^{e+i},
Next, the processor determines the mean of the minibatch of data (S21). An arithmetic expression for determining the mean μ is illustrated in S21 of
The processor also determines the variance σ^{2 }of the minibatch of data (S22). An arithmetic expression for determining the variance is illustrated in S22 of
The processor then normalizes the subject data x_{i }on the basis of the mean μ and the variance σ^{2 }using the arithmetic expression illustrated in S23 of
Hence, in
In
As illustrated in
The host machine 30 executes a program acquired by expanding a program stored in the auxiliary storage device 35 to the main memory 33, As illustrated in the figure, a DL execution program and training data are stored in the auxiliary storage device 35. The processor 31 transmits the DL execution program and the training data to the DL execution machine so as to cause the DL execution machine to execute the program.
The highspeed input/output interface 32 is an interface such as a PCI Express for connecting the processor 31 to hardware of the DL execution machine. The main memory 33 is an SDRAM, for example, that stores a program executed by the processor and data.
The internal bus 34 connects a peripheral device having a lower speed than the processor to the processor in order to relay communication therebetween. The lowspeed input/output interface 36 is an interface such as a USB, for example, for establishing a connection with a keyboard or a mouse of the user interface or establishing a connection with an Internet network.
The DL execution processor 43 executes deep learning processing by executing a program on the basis of the DL execution program and data transmitted from the host machine. The highspeed input/output interface 41 is a PCI Express, for example, for relaying communication with the host machine 30.
The control unit 42 stores the program and data transmitted from the host machine in the memory 45 and, in response to an instruction from the host machine, instructs the DL execution processor to execute the program. The memory access controller 44 controls processing for accessing the memory 45 in response to an access request from the control unit 42 and an access request from the DL execution processor 43.
The internal memory 45 stores the program executed by the DL execution processor, processing subject data, processing result data, and so on. The internal memory 45 is an SDRAM, a highspeed GDR5, a broadband HBM2, or the like, for example.
As illustrated in
In response to the transmissions, the DL execution machine 40 stores the input data and the execution program in the internal memory 45, and in response to the program execution instruction, the DL execution machine 40 executes the execution program (the learning program) on the input data stored in the memory 45 (S40). In the meantime, the host machine 30 waits for the DL execution machine to finish executing the learning program (S33).
After completing execution of the deep learning program, the DL execution machine 40 transmits a notification of the completion of program execution to the host machine 30 (S41) and transmits the output data to the host machine 30 (S42). When the output data are output data from the DNN, the host machine 30 executes processing for optimizing the parameters (the weights and so on) of the DNN in order to reduce the error between the output data and the correct data. Alternatively, in a case where the DL execution machine 40 executes the processing for optimizing the parameters of the DNN so that the output data transmitted by the DL execution machine include the optimized DNN parameters (weights and so on), the host machine 30 stores the optimized parameters.
Further, an instruction memory 45_1 and a data memory 45_2 are connected to the DL execution processor 43 via the memory access controller (MAC) 44. The MAC 44 includes an instruction MAC 44_1 and a data MAC 44_2.
The instruction control unit INST_CON includes a program counter PC, an instruction decoder DEC, and so on, for example. The instruction control unit fetches an instruction from the instruction memory 45_1 on the basis of an address in the program counter PC, whereupon the instruction decoder DEC decodes the fetched instruction and issues the decoded instruction to an arithmetic unit.
The scalar arithmetic unit SC_AR_UNIT includes a group formed from an integer arithmetic unit INT, a data converter D_CNV, and a statistical information acquisition device ST_AC. The data converter converts fixedpoint number output data output by the integer arithmetic unit INT to a floatingpoint number. The scalar arithmetic unit SC_AR_UNIT executes an operation using scalar registers SR0SR31 in a scalar register file SC_REG_FL and a scalar accumulate register SC_ACC. For example, the integer arithmetic unit. INT calculates the input data stored in one of the scalar registers SR0SR31 and stores the resulting output data in a different register. Further, when executing a multiplyandaccumulate operation, the integer arithmetic unit INT stores the multiplyandaccumulate result in the scalar accumulate register SC_ACC.
The register file REG_FL includes the aforementioned scalar register file SC_REG_FL and scalar accumulate register SC_ACC used by the scalar arithmetic unit SC_AR_UNIT. The register file REG_FL also includes a vector register file VC_REG_FL and a vector accumulate register VC_ACC used by the vector arithmetic unit VC_AR_UNIT.
The scalar register file SC_REG_FL includes the scalar registers SR0SR31, each of which has 32 bits, for example, and the scalar accumulate registers SC_ACC, each of which has 32×2 bits+α bits, for example.
The vector register file VC_REG_FL includes eight sets REG11REG07 to REG70REG77 of 32bit registers REGn0REGn7, each register having eight elements, for example. Further, the vector accumulate register VC_ACC includes registers A_REG0 to A_REG7 constituting eight elements, each element having 32×2 bits+α bits, for example.
The vector arithmetic unit VC_AR_UNIT includes arithmetic units EL0EL7 constituting eight elements. Each element EL0EL7 includes an integer arithmetic unit INT, a floating point arithmetic unit FP, and a data converter D_CNV. For example, the vector arithmetic unit inputs the registers REGn0REGn7 constituting the eight elements of one of the sets in the vector register file VC_REG_FL, whereupon operations are executed in parallel by the arithmetic units of the eight elements and the operation results are stored in the registers REGn0REGn7 constituting the eight elements of another set.
Further, the vector arithmetic unit executes multiplyandaccumulate operations using the arithmetic units of the eight elements and stores multiplyandaccumulate values that are the multiplyandaccumulate results in the registers A_REG0 to A_REG7 constituting the eight elements of the vector accumulate register VC_ACC.
The number of arithmetic unit elements in the vector registers REGn0REGn7 and the vector accumulate registers A_REG0 to A_REG7 is increased to 8, 16, or 32 elements in accordance with whether the number of bits of the operation subject data is 32, 16, or 8 bits.
The vector arithmetic unit includes eight statistical information acquisition devices or circuits ST_AC for respectively acquiring statistical information about the output data from the integer arithmetic units INT of the eight elements. The statistical information is information indicating the positions of the leftmost set bit for positive number or the left most zero bit for negative number in the output data of the integer arithmetic units INT. The statistical information is acquired in the form of a bit pattern to be described below using
As illustrated in
Addresses, the parameters of the DNN, and so on, for example, are stored in the scalar registers SR0SR31. Further, operation data from the vector arithmetic units are stored in the vector registers REG00REG07 to REG70REG77. Multiplication results and addition results between vector registers are stored in the vector accumulate register VC_ACC. Numbers of data (or samples) belonging to pluralities of bins of a maximum of eight types of histograms are stored in the statistical information registers STR0_0STR0_39 to STR7_0STR7_39 shown in
The scalar arithmetic unit SC_AR_UNIT executes arithmetic operations, shift operations, bifurcation, loading and storage, and so on. As described above, the scalar arithmetic unit includes the statistical information acquisition device ST_AC for acquiring statistical information including the positions of the bins of the histogram from the output data of the integer arithmetic unit INT.
The vector arithmetic unit VC_AR_UNIT executes floating point operations, integer operations, multiplyandaccumulate operations using the vector accumulate register, and so on. Further, the vector arithmetic unit executes operations to clear the vector accumulate register, multiplyandaccumulate (MAC) operations, cumulative addition, transfer to the vector registers, and so on. The vector arithmetic unit also executes loading and storage. As described above, the vector arithmetic unit includes the statistical information acquisition device SLAC for acquiring statistical information including the positions of the bins of the histogram from the output data of the respective integer arithmetic units INT of the eight elements.
Convolution and Normalization Operations Executed by DL Execution Processor
The DL execution processor clears the positivevalue statistical information and negativevalue statistical information stored in the register sets in the statistical information register file ST_REG_FL (S50). The DL execution processor then updates the positivevalue statistical information and negativevalue statistical information of the convolution operation output data while forwardpropagating through the plurality of layers of the DNN, for example while executing a convolution operation (S51).
The convolution operation is executed by, for example, the integer arithmetic units INT of the eight elements in the vector arithmetic unit and the vector accumulate register VC_ACC. The integer arithmetic units INT repeatedly execute the multiplyandaccumulate operation of the convolution operation and store the resulting operation output data in the accumulate register. The convolution operation may also be executed by the integer arithmetic unit INT in the scalar arithmetic unit SC_AR_UNIT and the scalar accumulate register SC_ACC.
The statistical information acquisition device ST_AC outputs a bit pattern indicating the bit positions of the leftmost set bit for positive number or the leftmost zero bit for negative number in the output data of the convolution operation, output from the integer arithmetic unit INT. Further, the statistical information aggregators ST_AC_1, ST_AC_2 add together the numbers of leftmost set bits for positive values at every bit positions of the operation output data, add together the numbers of the leftmost zero bits for negative values at every bit positions of the operation output data, and store the resulting cumulative addition values in one set of registers STRn_0STRn_39 in
Next, the DL execution processor executes normalization operations of S52, S53, S54. The DL execution processor determines the mean and the variance of the operation output data from the positivevalue and negativevalue statistical information (S52). The mean and the variance are calculated as illustrated in
Next, the DL execution processor calculates normalized output data by subtracting the mean from each output data sample of the convolution operation and dividing the result by the square root of the variance +ε (S53). This normalization operation is likewise performed as illustrated in
Further, the DL execution processor multiplies a learned parameter γ by each of the normalized output data samples determined in S53, adds a learned parameter β thereto, and then returns the distribution to the original scale (S54).
The DL execution processor repeats the processing of S61, S62, and S63 until all of the output data of the convolution operation in one minibatch are generated (S60). In the DL execution processor, the integer arithmetic units INT of the eight elements EL0EL7 in the vector arithmetic unit execute convolution operations respectively in the eight elements of the vector register and store eight sets of operation output data in the eight elements of the vector accumulate register VC_ACC (S61).
Next, the eight statistical information acquisition devices ST_AC of the eight elements EL0EL7 in the vector arithmetic unit and the statistical information aggregators ST_AGR_1, ST_AGR_2 aggregate the statistical information relating to the positive output data, among the eight sets of output data stored in the accumulate register, add the result to a value in one statistical information register in the statistical information register file ST_REG_FL, and store the result (S62).
Similarly, the eight statistical information acquisition devices ST_AC of the eight elements EL0EL7 in the vector arithmetic unit and the statistical information aggregators ST_AGR, ST_AGR_2 aggregate the statistical information relating to the negative output data, among the eight output data stored in the accumulate register, add the result to a value in one statistical information register in the statistical information register file ST_REG_FL, and store the result (S63).
By repeating the processing of S61, S62, and S63, described above, until all of the output data of the convolution operation in one minibatch have been generated, the DL execution processor tallies the number of leftmost set bit for positive number or the leftmost zero bit for negative number of the output data for each bit with respect to all of the output data. As a result, as illustrated in
Acquisition, Aggregation, and Storage of Statistical Information Next, acquisition, aggregation, and storage of the statistical information relating to the operation output data by the DL execution processor will be described. The statistical information is acquired, aggregated, and stored using an instruction transmitted from the host processor and executed by the DL execution processor as a trigger. Hence, the host processor transmits an instruction to acquire, aggregate, and store the statistical information to the DL execution processor in addition to the operation instructions relating to the respective layers of the DNN.
Next, a statistical information aggregator ST_AGR_1 adds together, and thereby aggregates, the “1”s of the respective bits of the eight bit patterns for either the positive sign or the negative sign. Alternatively, the statistical information aggregator ST_AGR_1 adds together, and thereby aggregates, the “1”s of the respective bits of the eight bit patterns for both the positive sign and the negative sign (S71).
Further, a statistical information aggregator ST_AGR_2 adds the value added and aggregated in S71 to the value in a statistical information register of the statistical information register file ST_REG_FL and stores the result in the statistical information register (S72).
The processing of S70, S71, and S72, described above, is repeated every time operation output data are generated as the result of the convolution operations performed by the eight elements EL0EL7 in the vector arithmetic unit. Once all of the operation output data in one batch have been generated and the processing described above for acquiring, aggregating, and storing the statistical information is complete, statistical information constituted by numbers of bins on histograms of the leftmost set bit for positive number or the leftmost zero bit for negative numbers of all of the operation output data in one minibatch is generated in the statistical information registers. As a result, the sum of the positions of the leftmost set bit for positive number or the leftmost zero bit for negative number of the operation output data in one minibatch is tallied for each bit
Acquisition of Statistical Information
As illustrated in
On this truth table, the first two rows depict an example in which all of the bits of the input in[39:0] match the sign bit “1”, “0”, and therefore the most significant bit out[39] of the output out[39:0] takes “1” (0x8000000000). The next two rows depict an example in which bit 38 in[38] of the input in[39:0] is different to the sign bit “1”, “0”, and therefore bit 38 out[38] of the output out[39:0] takes “1” and all the other bits take “C”. The bottom two rows depict an example in which bit 0 in[0] of the input in[39:0] is different to the sign bit “1”, “0”, and therefore bit 0 out[0] of the output out[39:0] takes “1” and all the other bits take “0”.
The logic circuit illustrated in
Further, when the sign bit in[39] matches in[38] but does not match in[37], the output of EOR38 takes “0” and the output of EOR37 takes “1”, whereby the output out[37] takes “1”. When the output of EOR37 is “1”, the other outputs out[39:38] and out[36:0] take “0” through the logical sums OR36OR0, the logical products AND36AND0, and the invert gate INV. This pattern applies likewise thereafter.
As is evident from
Aggregation of Statistical Information
A sign bit s[0] is added to each bit pattern BP. In
As shown in
On this logical value table, when the sign select control value sel=0, the all select control value all=0, and therefore the statistical information aggregator STAGR_1 cumulatively adds the number of 1s of the bits in the positivevalue bit patterns BP having a sign s=0 that matches the control value sel=0 and outputs an aggregate value of the statistical information as the output [39:0]. When, on the other hand, the sign select control value sel=1, the all select control value all=0, and therefore the statistical information aggregatorST_AGR_1 cumulatively adds the number of 1s of the bits in the negativevalue bit patterns BP having a sign s=1 that matches the control value sel=1 and outputs an aggregate value of the statistical information as the output [39:0]. Furthermore, when the all select control value all=1, the statistical information aggregator cumulatively adds the number of is of the bits in all of the bit patterns BP and outputs an aggregate value of the statistical information as the output [39:0].
As illustrated on the logical circuit in
As indicated by the output in
The statistical information register file. ST_REG_FL includes n sets (n=0 to 7) of 40 32bit registers STRn_39 to STRn_0, for example, and is therefore capable of storing the numbers of data (or samples) in 40 bins of each of n histograms. It is assumed here that the aggregation subject statistical information is stored in the 40 32bit registers STR0_39 to STR0_0 of n=0. The second statistical information aggregator ST_AGR,_2 includes adders ADD_39 to ADD_0 for adding the values of the aggregated values in[39:0] aggregated by the first statistical information aggregator ST_AGR_1 respectively to the cumulatively added values stored in the 40 32bit registers STR0_39 to STR0_0. The outputs of the adders ADD_39 to ADD_0 are then stored again in the 40 32bit registers STR0_39 to STR0_0. As a result, the numbers of samples in each of the bins of the subject histograms are stored in the 40 32bit registers STR0_39 to STR0_0.
Using the hardware circuits of the statistical information acquisition device ST_AC and the statistical information aggregators ST_AGR_1, ST_AGR_2 provided in the arithmetic units illustrated in
Examples of Calculation of Mean and Variance
Examples of calculation of the mean and the variance of the operation output data by the vector arithmetic unit will be described below. As an example, the vector arithmetic unit includes eight elements of arithmetic units and therefore calculates eight elements of data in parallel. Further, in this embodiment, the mean and the variance are calculated using the approximate values +2^{e+i}, −2^{e+i }corresponding to the bit position “i” of the leftmost set bit for positive number or the leftmost zero bit for negative number as the values of the operation output data. The arithmetic expressions for calculating the mean and the variance are as described in S21 and S22 of
Next, the DL execution processor executes the following processing until calculation has been completed with respect to all of the statistical information (NO in S72). First, the DL execution processor loads the eight elements on the smallest bit side of the positivevalue statistical information to a floating point vector register B1 (S73) and loads the eight elements on the smallest bit side of the negativevalue statistical information to a floating point vector register B2 (S74).
The histogram (statistical information) depicted in
The floating point arithmetic units FP of the eight elements of the vector arithmetic unit VC_AR_UNIT then calculate A×(B1−B2) in relation to the data in the eight elements of the registers A, B1, B2 and add the calculation results of the eight elements to the values in the respective elements of the floating point vector register C (S75). At that point, calculation with respect to the eight bins on the smallest bit side of the histogram is complete.
Hence, in order to perform calculation with respect to the next eight bins (the eight bins 0 to +7) of the histogram, the DL execution processor multiplies 2^{8 }by the values in the respective elements of the floating point vector register A using the floating point arithmetic units of the eight elements of the vector arithmetic unit (S76) and stores the result in the respective elements of the floating point vector register A. The DL execution processor then executes the processing of S72 to S76. In the processing of S73 and S74, the next eight elements (the numbers of samples in the next eight bins) of the positivevalue statistical information and the next eight elements (the numbers of samples in the next eight bins) of the negativevalue statistical information are loaded respectively to the registers B1, B2.
In the example of
The operations described above are performed using the eight elements of floating point arithmetic units FP in the vector arithmetic unit, but when a sufficient number of bits can be processed using the eight elements of integer arithmetic units INT in the vector arithmetic unit, the operations may be performed using the integer arithmetic units.
Next, the DL execution processor executes the following processing until calculation has been completed with respect to all of the statistical information (NO in S82). First, the DL execution processor squares the respective differences between the eight approximate values A in the register A and the mean value, and stores the calculation results in the eight elements of a floating point vector register A1 (S83). Further, the DL execution processor squares the respective differences between negatives −A of the eight approximate values A in the register A and the mean value, and stores the calculation results in the eight elements of a floating point vector register A2 (S84).
The DL execution processor then loads the eight elements on the smallest bit side of the positivevalue statistical information to the floating point vector register B1 (S85) and loads the eight elements on the smallest bit side of the negativevalue statistical information to the floating point vector register B2 (S86).
Further, in the DL execution processor, the eight elements of floating point arithmetic units in the vector arithmetic unit multiply the data in the eight elements of the registers A1 and B1, multiply the data in the eight elements of the registers A2 and B2, add together the respective multiplication values, add the addition results of the eight elements respectively to the data in the eight elements of the register C, and store the results respectively in the eight elements of the register C (S87). At that point, calculation with respect to the eight bins on the smallest bit side of the histogram is complete.
Hence, in order to perform calculation with respect to the next eight bins (the eight bins 0 to +7) of the histogram, the DL execution processor multiplies 2^{8 }by the respective values in the elements of the floating point vector register A using the eight elements of floating point arithmetic units in the vector arithmetic unit (S88). The DL execution processor then executes the processing of S82 to S88. In the processing of S83 and S84, calculations are performed with respect to new approximate values 2^{e+8}, 2^{e+9}, . . . , 2^{e+15 }in the register A. Further, in the processing of S85 and S86, the next eight elements (the numbers of samples in the next eight bins) of the positivevalue statistical information and the next eight elements (the numbers of samples in the next eight bins) of the negativevalue statistical information are loaded respectively to the registers B1, B2.
In the example of
The operations described above are also performed using the eight elements of floating point arithmetic units FP in the vector arithmetic unit, but when a sufficient number of bits can be processed using the eight elements of integer arithmetic units INT in the vector arithmetic unit, the operations may be performed using the integer arithmetic units.
Finally, the eight elements of floating point arithmetic units FP in the vector arithmetic unit execute the normalization operation illustrated in the processing of S13 in
Modified Example of Normalization Operation
In the above embodiment, divisive normalization, in which the mean and variance of the operation output data x are determined, the mean is subtracted from the operation output data x, and the result is divided by the square root (the standard deviation) of the variance was described as an example of the normalization operation. As another example of the normalization operation, however, this embodiment may also be applied to subtractive normalization, in which the mean of the operation output data is determined and the mean is subtracted from the operation output data.
Example of Data subject to Normalization Operation
In the above embodiment, an example of normalization of the operation output data x of an arithmetic unit was described. However, this embodiment may also be applied to normalization of a plurality of input data of a minibatch.
In this case, calculation of the mean value can be simplified using the numbers of samples and the approximate values of the bins of a histogram obtained by acquiring and aggregating the statistical information of a plurality of input data.
In this specification, the normalization subject data (the normalization subject data or the subject data) include operation output data, input data, and so on.
Example of Bins of Histogram
In the above embodiment, a logarithm (log_{2}X) of the operation output data X to base 2 was set as the unit of the bins. However, a multiple of two of the above logarithm (2×log_{2}X) may be set as the unit of the bins. In this case, a distribution (a histogram) of the leftmost even number set bits for positive number or the leftmost even number zero bits for negative number of the operation output data X is acquired as the statistical information such that the range of the bins is 2^{e+2i }to 2^{e+2(i+1) }(where i is an integer of 0 or more) and the approximate value is 2^{e+2i}.
Example of Approximate Value
In the above embodiment, the approximate value of each bin is set at the value 2^{e+i }of the leftmost set bit for positive number or the leftmost zero bit for negative number. However, when the range of the bins is 2^{e+i }to 2^{e+i+1 }(where i is an integer of 0 or more), the approximate value may be set at (2^{e+i}+2^{e+i+1})/2.
According to this embodiment, as described above, a distribution (a histogram) of the leftmost set bit for positive number or the leftmost zero bit for negative number of input data or intermediate data (operation output data) in a DNN can be acquired as statistical information, and the mean and variance determined in a normalization operation can be calculated easily using approximate values +2^{e+i}, −2^{e+i }of the respective bins of the histogram and the numbers of data samples in the respective bins. As a result, reductions can be achieved in the amount of power consumed by a processor during the normalization operation and the amount of time used for learning.
According to the present embodiment, a normalization operation can be accelerated.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.