REALTIME SPEAKERDEPENDENT NEURAL VOCODER

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
0Forward
Citations 
0
Petitions 
3
Assignments
First Claim
1. A method for generating speech samples, the method comprising:
 receiving an input tensor;
splitting said received input tensor into a first portion and a second portion;
performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;
summing said first intermediate result and said second intermediate result to generate a third intermediate result;
applying a postprocessing function on said third intermediate result to generate a fourth intermediate result;
computing an output tensor by summing said received input tensor with said fourth intermediate result;
recursing by setting said input tensor to said output tensor until said output tensor is of size one in a predetermined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a predetermined dimension.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for a recursive deeplearning approach for performing speech synthesis using a repeatable structure that splits an input tensor into a left half and right half similar to the operation of the Fast Fourier Transform, performs a 1D convolution on each respective half, performs a summation and then applies a postprocessing function. The repeatable structure may be utilized in a series configuration to operate as a vocoder or perform other speech processing functions.
0 Citations
No References
No References
20 Claims
 1. A method for generating speech samples, the method comprising:
receiving an input tensor; splitting said received input tensor into a first portion and a second portion; performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;summing said first intermediate result and said second intermediate result to generate a third intermediate result; applying a postprocessing function on said third intermediate result to generate a fourth intermediate result; computing an output tensor by summing said received input tensor with said fourth intermediate result; recursing by setting said input tensor to said output tensor until said output tensor is of size one in a predetermined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a predetermined dimension.  View Dependent Claims (2, 3, 4, 5, 6, 7)
 8. A system for generating speech samples comprising:
a plurality of FFTNet blocks arranged in series, wherein each FFTNet block includes a splitter module, a convolution module, a summation block, and a postprocessing module, wherein said postprocessing module generates an output based upon said composite tensor, a fully connected layer, wherein said fully connected layer is coupled to a last FFTNet block in said series; and
,a softmax classifier coupled to an output of said fully connected layer.  View Dependent Claims (9, 10, 11, 12, 13, 14)
 15. A computer program product including one or more nontransitory machinereadable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for generating speech samples, the process comprising:
receiving an input tensor; splitting said received input tensor into a first portion and a second portion; performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;summing said first intermediate result and said second intermediate result to generate a third intermediate result; applying a postprocessing function on said third intermediate result to generate a fourth intermediate result; computing an output tensor by summing said received input tensor with said fourth intermediate result; recursing by setting said input tensor to said output tensor until said output tensor is of size one in a predetermined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a predetermined dimension.  View Dependent Claims (16, 17, 18, 19, 20)
1 Specification
This disclosure relates to techniques for performing realtime speech synthesis. In particular, this disclosure relates to for performing realtime speech synthesis in the voice of a particular person.
Synthesizing realistic sound human speech in realtime based upon linguistic features and F0 is a challenging problem. The application of deep learning to speech synthesis such as the WaveNet project has produced promising results. Deeplearning approaches to speech synthesis such as WaveNet have many applications, including the classical texttospeech (“TTS”) problem. While initially WaveNet and others addressed TTS starting from linguistic features, ensuing work showed that speech could be synthesized directly from input text. The approach has also been adapted to other problems, including voice conversion, speech enhancement, and musical instrument synthesis.
Despite the impressive quality of the synthesized waveform, deep learning techniques such as WaveNet still suffer from several drawbacks. In particular, these approaches require substantial training corpus (roughly 30 hours), the synthesis process is slow (about 40 minutes to produce a second of audio), and the result contains audible noise.
More recent work showed that WaveNet could also be used as a vocoder, which generates a waveform from acoustic features, rather than linguistic features. Working from acoustic features, the training process is effective with a substantially smaller corpus (roughly one hour) while still producing higher quality speech than baseline vocoders like mellog spectrum approximation (MLSA). Several research efforts have addressed the problem of computational cost including algorithmic improvements for the same architecture called Fast WaveNet, which can synthesize a second of audio in roughly a minute. Other efforts have been able to achieve realtime synthesis by reducing the WaveNet model size significantly, but at the expense of noticeably worse voice quality. Other efforts have facilitated parallelization of WaveNet for GPU computing allowing realtime operation with some GPU clusters. However, this method does not reduce actual computational costs, but instead demands a far costlier hardware solution.
In general, deeplearning techniques for performing speech synthesis such as WaveNet suffer from significant drawbacks, namely requiring a large training corpus and having slow synthesis time, and therefore new approaches are necessary. Further, known methods such as the WaveNet model suffer from high computational complexity due to the employment of a dilated convolution and gated filter structure. Thus, deeplearning techniques for performing speech synthesis achieving a large receptive field for correlating audio samples far in the past with a current input sample that do not impose significant computational penalties are required.
The present disclosure describes a deeplearning approach for performing speech synthesis herein referred to as a Fast Fourier Transform (“FFT”) neural network, or “FFTNet” for short. According to one embodiment of the present disclosure, FFTNet may be employed as a synthesizer that transforms audio features to a speech audio signal. The audio features may be inferred from other processes. For example, FFTNet may be used to synthesize speech audio with high quality.
FFTNet may also be employed to perform signal compression. Because speech may be synthesized from lowdimensional features, it is possible to transmit the audio features instead of the signal for audio transmission. Thus, for example, 10 kbps audio features may be transmitted to a receiver where they are decoded using FFTNet, which achieves better perceptual quality than 32 kbps MP3.
According to one embodiment of the present disclosure, an FFTNet may be trained using backpropagation and gradient descent. In particular, according to one embodiment of the present disclosure training sets may be generated using audio samples of the same speaker. Then, acoustic features such as F0 and MCC may be extracted and interpolated to match the audio samples.
FFTNet, provides an alternative deep learning architecture, coupled with several improved techniques for training and synthesis. In contrast to conventional approaches that downsample audio via dilated convolution in a process that resembles wavelet analysis, the FFTNet architecture resembles a classical FFT, and achieves far greater computational efficiency and uses substantially fewer parameters than the WaveNet model. According to one embodiment of the present disclosure, a deeplearning speech synthesis technique is performed utilizing a recursive algorithm that splits each successive input by a factor of 2. A 1×1 convolution is applied to each ½ of the input block whereupon the convolved portions are summed. The model architecture of FFTNet substantially reduces the computational complexity of known deeplearning speech synthesis methods such as WaveNet that rely upon a dilated convolution and gated filter structure. According to one such embodiment, the recursive structure of the FFT is utilized. Further, the FFT kernel e^{i2πkn/N }is replaced with a small network structure that learns a nonlinear transformation and employs a 1×1 convolution. In other words, the FFT may be understood as performing a linear transformation with respect to each point due to the multiplication of each time point by the FFT kernel. According to one such embodiment, the FFT kernel is replaced by a 1×1 convolution and a nonlinear network herein referred to as a postprocessing block. By utilizing the FFT kernel with a small network structure as described herein, the computational complexity of the WaveNet model, which requires a gated filter structure, skip layers and other architectural details is significantly reduced. This allows, for example, the generation of speech synthesis in realtime or near realtime.
According to some such embodiments, FFTNet models produce audio more quickly (>70× faster) than the Fast WaveNet formulation thereby enabling realtime synthesis applications. Moreover, when used as a vocoder, FFTNet produces higher quality synthetic voices, as measured by a “mean opinion score” test than conventional approaches. The FFTNet training and synthesis techniques can improve the original WaveNet approach such that the quality of the synthesized voice is on par with that of the FFTNet architecture (albeit much slower to synthesize). The FFTNet architecture may also be leveraged in a variety of other deep learning problems such as classification tasks and autoencoders. Numerous other embodiments and variations will be appreciated in light of this disclosure.
General Overview
Existing techniques for speech synthesis using deep learning such as WaveNet model the probability of a speech waveform as follows:
That is, the joint probability of a waveform x={x_{1}, . . . , x_{T}} is factorized as a product of conditional probabilities as shown above. Each audio speech sample x_{t }is conditioned on the samples at all previous timesteps. Similar to PixelCNNs, the conditional probability distribution is modeled by a stack of convolutional layers. The model outputs a categorical distribution over the next value x_{t }with a softmax layer and is optimized to maximize the loglikelihood of the data with respect to the parameters. A dilated causal convolution structure is utilized that allows for a larger receptive field. Further, similar to PixelCNN, in order to simulate the behavior a LSTMs (“Long Short Term Memory”) networks, gated activation functions are utilized. Further, residual and skip connections are utilized.
Existing methods such as WaveNet rely upon a dilated convolution structure such that an nlayer network has a receptive field of 2^{n }meaning that as many as 2^{n }previous samples can influence the synthesis of the current sample, which leads to superior synthesis quality. However, with these types of techniques such as WaveNet, only one sample is generated per iteration and thus to generate one second of audio sampled at 16 KHz, the causal dilated network needs to be applied 16,000 times. Faster methods have been proposed, which can produce 200 samples per second, but the performance is still far from realtime on personal computers. With dilated convolution, the nodes that influence the prediction of a new sample may be represented as an inverted binary tree structure. Thus, dilated convolution resembles wavelet analysis in that each filtering step is followed by downsampling. The causal dilated convolutional structure, gated activation functions and skip connections of known techniques such as WaveNet introduce significant computational complexity.
In contrast, and according to one embodiment of the present disclosure, the recursive structure of the CooleyTukey Fast Fourier Transform provides an alternative model for providing the effect of a dilated convolution by increasing the receptive field. A number of benefits flow from using an FFTbased alternative structure, as will be appreciated in light of this disclosure. Given an input sequence x_{1}, x_{2}, . . . , x_{n}, the FFT computes the kth frequency component f_{k }from the timedomain series x_{0 }. . . x_{N1 }as follows:
and the above equation can be simplified as:
According to some such embodiments, x_{n }may be interpreted as a node with K channels corresponding to quantization levels (e.g., 256 quantization channels). The FFT kernel e^{−2πi(2n)k/N }may be interpreted as a transformation function. In this context, each term
is analogous to applying a transformation to previous nodes x_{2n }and x_{2n+1 }and summing up the results. In the classical FFT the FFT kernel operates as a linear transformation on the input samples. According to one embodiment of the present disclosure, the classical FFT kernel is replaced by a small network structure that performs 1×1 convolution in conjunction with a postprocessing block that may perform a nonlinear transformation.
According to one such embodiment, given an input x_{0:N }defined as the 1D series (x_{0}, x_{1}, . . . x_{N1}) a series of layers or blocks herein referred to as FFTNet blocks, clip or segments the input into two halves (herein referred to as a right half and a left half) as follows:
(x_{L}=x_{0:N/2 }and x_{R}=x_{N/2:N})
and then sums up the results:
z=W
_{L}
*x
_{L}
+W
_{R}
*x
_{R }
where W_{L }and W_{R }are 1D convolution weights for x_{L }and x_{R}. Each FFTNet block further incorporates a nonlinear activation function, which may be a ReLU activation function followed by a 1D convolution to produce inputs for the next layer according to the relation:
x=ReLU(conv1×1(ReLU(z)))
Replacing the classical FFT kernel with this FFTNet block achieves the same increase in receptive field as with conventional techniques such as WaveNet, but at the same time, obviates the need for gated activation functions and skip layers, which would otherwise increase the computational complexity.
According to some such embodiments, auxiliary conditions such as linguistic features are transformed by the 1D convolution and added to z, as follows
z=(W_{L}*+W_{R}*x_{R})+(V_{L}*h_{L}+V_{R}*h_{R}),
where h_{L }and h_{R }are the two halves of the condition vector h and V_{L }and V_{R }are 1D convolution weights. In some such cases, note that if the condition information is stationary along the time axis the condition information becomes V*h_{N}, instead of (V_{L}*h_{L}+V_{R}*h_{R}).
Various uses of the FFTNet architecture as provided herein will be apparent. For example, according to one embodiment of the present disclosure an FFTNet may be utilized as a vocoder. In particular, h_{t }may be F0 (pitch) and MCC (“MelCepstral Coefficient”) features at time t. To generate the current sample x_{t}, the previously generated samples x_{t−N:t }and auxiliary condition h_{t−N+1:t+1 }(shifted forward by 1) are utilized as the network input. According to one specific example such embodiment, the auxiliary condition is obtained as follows. An analysis window of size 400 is performed every 160 samples. The MCC and F0 features are extracted for each overlapping window. For the h_{t }corresponding to the window centers, the computed MCC and F0 values (26 dimensions in total) are assigned. For the h_{t }that are not located at the window centers, linear interpolation is utilized to obtain values based on the assigned h_{t }in the last step. Numerous other use cases and applications will be appreciated in light of this disclosure, and this disclosure is not intended to be limited to specific details of any such illustrative examples.
According to one further specific example embodiment, and as discussed in more detail below, an FFTNet utilizes a fully connected layer followed by a softmax layer (size 1 with K=256 channels) as the last two layers to produce a posterior distribution of the new sample'"'"'s quantized values. To determine the final value of the current sample, either an argmax or random sampling may be performed on the posterior distribution.
FFTNet Methodology and Architecture
The process is initiated in 102. In 103, a 1×1 convolution is performed. This 1×1 convolution layer transforms an input (e.g., 256channel onehot encoded ulawquantized signal) into the right number of channels for FFTNet (e.g., 128) before FFTNet operation starts.
As will be described in detail below, according to one embodiment of the present disclosure, an FFTNet may comprise a plurality of layers of a repeatable structure that are successively computed to generate voice output samples. According to one embodiment, the number of layers comprising an FFTNet may be log_{2}(N), where N is the size of a tensor dimension for an audio block input. In 104, it is determined whether all layers have been computed. If so (‘Yes’ branch of 104), in 120 a fully connected layer 250 is applied to the current output. In 122, a softmax classifier is applied to the output of the fully connected layer to generate an output sample. The process ends in 116.
If all layers have not been computed (No′ branch of 104), flow continues with 124 whereby the current input is set either to set to the previous layer'"'"'s output. That is, if this is the first layer, the input is set to the original input to the FFTNet, while if, instead, the current layer is not the first layer, the layer input is set to the previous layer'"'"'s output. In 106, the layer'"'"'s input is split evenly into right and left halves. Thus, if the input to the layer is of size N, the left and right halves are of size N/2. In 108, a 1×1 convolution is performed separately on the right and left halves. A method for generating a convolution kernel is described below. In 110, the convolved right and left halves are summed to generate a composite tensor. Thus, after the summing operation, the resulting tensor has a dimension of size N/2.
In 112, a first activation function is applied to the composite tensor. In 114, a 1×1 convolution is applied. In 118, a second activation function is applied. Flow then continues with 124. According to one embodiment of the present disclosure and as discussed in more detail below, the first and second activation functions may be ReLU (“Rectified Linear Unit”) functions.
The operation of an FFTNet block or layer will now be described. For purposes of the present discussion the terms FFTNet block and FFTNet layer will be utilized interchangeably. As previously mentioned, each FFTNet block/layer may comprise a repeatable structure that receives an input tensor and generates an output tensor.
According to one embodiment of the present disclosure a skiplayer implementation may be utilized. In this case, according to one embodiment of the present disclosure, the input is summed with the output at each iteration.
As described in detail below, input tensor 230(1) and output tensor 230(2) may also comprise a first dimension that encodes the value of audio samples in a quantized fashion. This quantized dimension is referred to as “channels”, which is a term well understood in the context of deep neural networks. In particular, in the context of image data, the channels typically encode the red, blue and green components of a pixel, and therefore there are typically 3 channels, which may vary across a convolutional neural network structure. In the context of the present disclosure, however, the channels encode the quantization level of an audio sample. In particular, according to one embodiment of the present disclosure, the values of input tensor 230(1) may be quantized into an arbitrary number of bins (e.g., 256). In this case, input tensor 230(1) and output tensor 230(2) may include a channel dimension of size 256. According to one embodiment of the present disclosure, the input data (real value audio samples) is quantized into a particular number of channels. The channel size may then be reduced to accommodate the number of channels used in the FFTNET using a 1×1 convolutional layer.
For example, assume the input data audio samples is of size [8000, 1]. The audio samples may be quantized using ulaw to obtain quantized audio samples with size [8000, 256], for example. Suppose for purposes of this example that the FFTNet utilizes 128 channels instead of 256. In order to accommodate the 128 channels, a 1×1 convolutional layer may be utilized to transform the 256channel audio samples into 128channel audio samples with a resulting data size of [8000, 12]. In this example, the kernel dimension for the 1×1 convolution is [1, 1,256, 128] with the first two dimensions the convolution size (1×1) and the last two a fully connected network that transforms 256 channels into 128 channels.
A second tensor dimension encoded in input tensor 230(1) comprises a block size, which is an even number. For purposes of the present discussion, this dimension will be referred to as the block size dimension. In particular, because, as will be described below, each FFTNet block divides its input tensor 230(1) into a left and right half and generates an output tensor 230(2) having a block size dimension that is ½ the size of the block size dimension of input tensor 230(1), it must be an even number.
The operation of an FFTNet block 220 will now be described. Referring now to the operation of FFTNet block 220, input tensor 230(1) is received by splitter 260, which splits input tensor 230(1) into a left input tensor 240(1) and a right input tensor 240(2) with respect to one of its tensor dimensions. Left input tensor 240(1) and right input tensor 240(2) are then provided to respective 1D convolvers 222, which respectively perform convolution on left input tensor 240(1) and right input tensor 240(2) to generate respective left convolved tensor 246(1) and right convolved tensor 246(2). Although
According to one embodiment of the present disclosure, postprocessing block 236 may further comprise a first ReLU (“Rectified Linear Unit”), 1×1 convolution block and a second ReLU. As will be appreciated, according to one embodiment of the present disclosure, first and second ReLU may implement an activation function according to the following relationship:
f(x)=x^{+}=max(x,0)
According to alternative embodiments, postprocessing block 236 may implement any type of activation function(s) including a sigmoid function or tan h function. 1×1 block may perform a 1D convolution operation.
Thus, as previously described, FFTNet block 220 receives an input tensor 230(1) having a block size dimension of size N and outputs output tensor 230(2) with the corresponding block size dimension of size N/2. A series of FFTNet blocks 220 may be arranged to perform a recursive operation in which an input tensor with block size dimension of N is processed repeatedly until the block size dimension is of size 1.
As shown in
 [1, BLOCK_SIZE, QUANTIZATION_LEVELS]
where BLOCK_SIZE is the number of samples processed during each iteration and QUANTIZATION_LEVELS is a number of quantization levels for quantizing audio samples. Although,FIG. 2b does not depict the conversion of input sample block 228 into a tensor of the dimensionality described, it will be understood that such a conversion may take place and according to other embodiments any other arbitrary tensor dimension may be utilized to encode input sample block.
FFTNet block 220(1) generates output tensor 230(1), which is then provided as input to FFTNet block 220(2), which generates output tensor 230(2) as previously described with respect toFIG. 2a . A similar operation will occur with respect to each succeeding FFTNet block 220(i). Thus, each FFTNet block 220(i) receives as input the output tensor 230(i−1) of a previous FFTNet block 220(i−1) and generates output tensor 230(i). Final FFTNet block 220(N) receives output tensor 230(N−1) from FFTNet block 220(N−1) (not shown inFIG. 2b ) and processes this to generate output tensor 230(N).
 [1, BLOCK_SIZE, QUANTIZATION_LEVELS]
Output tensor 230(N) from the final FFTNet block 220(N) is provided to fully connected layer 250, which may comprise a single fully connected layer 250 of artificial neural network nodes. Fully connected layer 250 generates fully connected layer output 234, which is provided to softmax classifier 224. Softmax classifier 224 processes fully connected layer output 234 to generate final output 232, which, according to one embodiment of the present disclosure, may comprise a single audio sample. As shown in
As previously described, input/output tensors 230(1)230(N) of FFTNet blocks 220(1)220(N), fully connected layer output 234 and softmax classifier output 232 may comprise tensors of a particular dimension. As will be appreciated in the field of deep learning and deep neural networks, a tensor may comprise a multidimensional array. Example tensor dimensions for sample input sample block 228, output tensors 230(1)230(N), fully connected layer output 234 and final output 232 are described below.
Similar to the operation of FFTNet block 220(2), input tensor 230(2) is received by FFTNet block 220(2). In this case, input tensor 230(2) has a block size dimension of size 4. Input tensor 230(2) is split into a left tensor and right tensor having a block size dimension of size 2, which are respectively processed by 1D convolvers 222 to generate left convolved tensor 246(3) and right convolved tensor 246(4). Left and right convolved tensors 246(3)246(4) are summed by summer 242 and the composite tensor (not shown in
Training
According to one embodiment of the present disclosure, an FFTNet may be trained using backpropagation and gradient descent using an Adam optimizer in conjunction with mini batches. In particular, according to one embodiment of the present disclosure training sets may be generated using audio samples of the same speaker. The acoustic features F0 and MCC are extracted and interpolated to match the audio samples. At training time, batches of size [6, 4000] may be utilized in which 6 utterances are randomly selected. For each utterance, a length of 4000 audio samples is selected together with the corresponding F0 and MFCC as input data. According to one embodiment of the present disclosure, the training data size is [6, 4000] for the audio samples, [6, 4000, 1] for pitch and [6, 4000, 26] for MCC.
Further, according to one embodiment of the present disclosure, 10 FFTNet blocks are utilized, resulting in a receptive field of 2048. To perform efficient training, the splitsummation structure of FFTNet is utilized in conjunction with zeropadding
Tensor Dimensions
According to one embodiment of the present disclosure, an identical operation is applied for each batch at training time. According to one embodiment of the present disclosure, Input sample block 228 may comprise previously generated samples of dimensions [batch_size, 1024, 1] in floating point format, where 1024 is the block size. According to one embodiment of the present disclosure, the input samples are quantized as floatingpoint data to [batch_size, 1024, 256] (i.e., 256 quantization bins), where the third dimension is the channel dimension of size 256.
For example, assume the utilization of 128 channels. Prior to the application of the first FFTNet block 220, a 1×1 convolver 222 transforms 256 bins into 128 channels:
 [batch_size, 1024, 128]
Each FFTNet block 220 reduces the length by a factor of two, so after each FFTNet block the tensor dimensions appear as follow:
 [batch_size, 512, 128]
 [batch_size, 256, 128]
 [batch_size, 128, 128]
 . . .
 [batch_size, 4, 128]
 [batch_size, 2, 128]
 [batch_size, 1, 128]
Now, the 2nd dimension can be extracted such that the following tensor dimensions are:  [batch_size, 128]
Fully connected layer 250 may be then be applied. According to one embodiment of the present disclosure, fully connected layer 250 may be equivalent to a 1×1 convolution. Fully connected layer 250 may transform FFTNet Block output into 256 channels because the output is the posterior distribution of 256 quantized value bins.
 [batch_size, 256]
The final output (after another fully connected layer) may be of dimension:  [batch_size, 256]
 [batch_size, 256]
According to one embodiment of the present disclosure, output samples are fed back as input to the FFTNet 200 in input sample block 228. For example, assuming a sample input block 228 of size 1024, [1, 2, 3, . . . , 1024] for input and output sample [1025]. In the next step, the input [2, 3, . . . , 1025] are utilized to produce sample [1026].
According to one embodiment of the present disclosure, softmax classifier 224 may utilize a crossentropy loss function for training. According to one embodiment of the present disclosure, a loss function may be a crossentropy loss function. The crossentropy loss function may be expressed as:
where y_{t }is the target (correct) word at each time step t and ŷ_{t }is the prediction. Typically, the full sequence may be treated as a single training example so that the total error is the sum of errors at each time step.
According to one embodiment of the present disclosure, softmax classifier 224 may be represented as:
Effectively, softmax classifier 224 maps a Kdimensional vector k to a Kdimensional vector σ(z) of real values in the range [0, 1] that add up to 1 so that σ(z) exhibits properties of a probability mass function.
Zero Padding
According to one embodiment of the present disclosure, an FFTNet 200 may employ zeropadding, which achieves dilated convolution. In particular, given a sequence of length M, the input x_{1:M }is shifted to the right by N samples with zero padding. The N padded zeros are denoted as x_{−N:0 }where ∀j<0, x_{j}=0. The equation describing each FFTNet block then becomes:
z
_{0:M}
=W
_{L}
*x
_{−N:M−N}
+W
_{R}
*x
_{0:M }
According to some embodiments, experimental results demonstrate that without zero padding, an FFTNet 200 tends to produce noise or gets stuck (outputting zeros) when the inputs are near silence. Zeropadding during training allows the network to generalize to partial input. According to some embodiments, training sequences of length between 2N and 3N are utilized so that a significant number (33%50%) of training samples are partial sequences.
Conditional Sampling
As FFTNet 200 includes a softmax classifier 224 as the final processing element, the prediction error comes from two sources: training error and true error. The true error corresponds to noise mostly resides in unvoiced signal. According to one embodiment of the present disclosure, to synthesize noise, an FFTNet 200 may learn the noise'"'"'s distribution by the output posterior distribution on which random sampling may be employed to obtain the sample'"'"'s value. Training error comes from the model itself. The prediction strategy that provides the minimal training error is argmax. However, argmax is not suitable for simulating signals that contain true noise, since it always chooses the center of a noise distribution leading to zero noise in the synthesis output. Instead of using argmax universally, according to some embodiment of the present disclosure, different prediction strategies are utilized for unvoiced and voiced sounds. In particular, a different strategy may be employed for voiced and unvoiced sounds.
Injected Noise
Because of training error, the synthesized samples always contain some amount of noise; during synthesis, the network will generate samples that get noisier over time. The output samples serve as network input to generate the next sample, adding more and more randomness to the network. When the noise builds up, the output sample might drift leading to clicking artifacts. According to one embodiment of the present disclosure, to avoid such drift, an FFTNet 220 may be modified to be robust to noisy input samples. In particular, this is achieved by injecting random noise to the input x_{0,M }during training. According to one embodiment of the present disclosure, the amount of noise to inject into the input is based on the amount of noise the network is likely to produce. According to one embodiment of the present disclosure, based upon the observation that the prediction is often one category (out of 256) higher or lower than the groundtruth category Gaussian noise centered at 0 with a standard deviation of 1/256 (based on 8bit quantization) is injected.
PostSynthesis Denoising
Experiments show that injected noise eliminates clicking artifacts almost perfectly for but introduces a small amount of random noise to voiced samples. According to one embodiment of the present disclosure, a spectral subtraction noise reduction is employed to reduce the injected noise for the voice samples. The amount of reduction is proportional to the amount of noise injected during training. It is possible to apply noise reduction to the unvoiced samples too, but it may result in artifacts.
Inference
Once trained, the FFTNet may be utilized in an inferencing application, such as voice synthesis, according to some embodiments.
Tensor Dimensions
According to one embodiment of the present disclosure, at inference time, the above tensor dimensions described with respect to training time are preserved except the batch size is 1.
Experimental Results
According to one embodiment of the present disclosure, four voices, two male (BDL,RMS) and two female (SLT,CLB), from the CMU Arctic dataset were used in experiments. The first 1032 utterances (out of 1132) were used for training and the remaining were used for evaluation. The waveforms were quantized to 256 categorical values based on μlaw. 25coefficient Mel Cepstral Coefficients (with energy) and F0 were extracted from the original samples.
Four networks were constructed for each voice, 2 WaveNets and 2 FFTNets 200. For each type of network, two training strategies were employed:
Strategy On: Zero Padding Only
Strategy Two Applies All Training Techniques (described above)
For comparison, a WaveNet was implemented containing two stacks of 10layer dilated convolution (d=20, 21, . . . , 29) with 256 dilation and 128 skip channels. The total receptive field was 2048 samples. Varying numbers of channels were tested and an optimal configuration for performing vocoding was determined.
According to one embodiment of the present disclosure, an FFTNet implementation 200 utilizing 11 FFTlayers with 256 channels and a receptive field of 2048 was utilized. Such an FFTNet configuration has less than 1M parameters and with proper caching, the computation cost for generating one second of audio (16 kHz) is approximately 16GFLOPs. This means that a modern CPU could generate audio samples in realtime. In each training step, a minibatch of 5×5000sample sequences was fed to the network, optimized by Adam algorithm with a training rate of 0.001. The variance of injected noise was set to be 1/256. In each minibatch, all sequences were determined from different utterances.
A WaveNet was trained using 200,000 steps. A FFTNet was trained with 100,000 steps to ensure convergence. Based upon experiments, synthesis using FFTNet resulted in more than 70 times faster performance than Fast WaveNet, requiring only 0.81 second to generate 1 second of audio on a laptop CPU (2.5 GHz Intel Core i7).
Subjective Evaluation
A Mean Opinion Score (MOS) test that asks subjects to rate the quality of the synthetic utterances was performed. Participants from United States who have an approval rate over 90% were recruited to ensure the reliability of the study results. A validation test to ensure a subject was paying attention was also performed. Six conditions were established for each utterance as follows:
In each task (called a HIT), a subject was presented with 32 different sentences in which 24 of them were composed of 4 instances from each of the above 6 conditions. From a heldout set of sentences, 4 more instances of the “Real” condition and 4 more cases of badly edited “Fake” (3bit Alaw encoded) condition to validate that the subject was paying attention and not guessing randomly were determined. For the data to be retained, the subject was allowed to make at most one mistake on these validation tests, by either rating <3 on “Real” examples or >3 on “Fake” examples. 480 HITs (120 per voice) were launched and 446 after validation were retained.
Objective Evaluation
A distortion between the original and the synthesized speech using RMSE and MCD was performed. RMSE measures frequency domain difference between two signals; and MCD measures the difference in the cepstral domain, which reflects whether the synthesized speech can capture the characteristics of the original speech. Both measurements are in dB. The result is shown in the following table:
The result shows that MLSA tends to preserve most of the cepstral and spectral structure while the MOS test puts it in a significantly lower tier as it generates audible oversmoothing artifacts. The training techniques described above do not reduce distortion in WaveNet, but they significantly improve FFTNet in both metrics. WaveNet with the proposed techniques performs significantly better in subjective evaluation than the one without.
Integration in Computing System and Network Environment
It will be understood that network 510 may comprise any type of public or private network including the Internet or LAN. It will be further readily understood that network 510 may comprise any type of public and/or private network including the Internet, LANs, WAN, or some combination of such networks. In this example case, computing device 500 is a server computer, and client application 512 may be any typical personal computing platform
As will be further appreciated, computing device 500, whether the one shown in
In some example embodiments of the present disclosure, the various functional modules described herein and specifically training and/or testing of network 340, may be implemented in software, such as a set of instructions (e.g., HTML, XML, C, C++, objectoriented C, JavaScript, Java, BASIC, etc.) encoded on any nontransitory computer readable medium or computer program product (e.g., hard drive, server, disc, or other suitable nontransitory memory or set of memories), that when executed by one or more processors, cause the various creator recommendation methodologies provided herein to be carried out.
In still other embodiments, the techniques provided herein are implemented using softwarebased engines. In such embodiments, an engine is a functional unit including one or more processors programmed or otherwise configured with instructions encoding a creator recommendation process as variously provided herein. In this way, a softwarebased engine is a functional circuit.
In still other embodiments, the techniques provided herein are implemented with hardware circuits, such as gate level logic (FPGA) or a purposebuilt semiconductor (e.g., application specific integrated circuit, or ASIC). Still other embodiments are implemented with a microcontroller having a processor, a number of input/output ports for receiving and outputting data, and a number of embedded routines by the processor for carrying out the functionality provided herein. In a more general sense, any suitable combination of hardware, software, and firmware can be used, as will be apparent. As used herein, a circuit is one or more physical components and is functional to carry out a task. For instance, a circuit may be one or more processors programmed or otherwise configured with a software module, or a logicbased hardware circuit that provides a set of outputs in response to a certain set of input stimuli. Numerous configurations will be apparent.
The foregoing description of example embodiments of the disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of this disclosure. It is intended that the scope of the disclosure be limited not by this detailed description, but rather by the claims appended hereto.
The following examples pertain to further embodiments, from which numerous permutations and configurations will be apparent.
Example 1 is a method for generating speech samples, the method comprising receiving an input tensor, splitting said received input tensor into a first portion and a second portion, performing a 1×1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result, summing said first intermediate result and said second intermediate result to generate a third intermediate result, applying a postprocessing function on said third intermediate result to generate a fourth intermediate result, computing an output tensor by summing said received input tensor with said fourth intermediate result, recursing by setting said input tensor to said output tensor until said output tensor is of size one in a predetermined dimension, and, performing a prediction of a speech sample using said output tensor of size one in a predetermined dimension.
Example 2 includes the subject matter of Example 1, wherein performing a prediction of a speech sample using said output tensor of size one in a predetermined dimension further comprises processing said output tensor by a fully connected neural network layer to generate a fifth intermediate result, and, applying a softmax classifier to said fifth intermediate result to generate a speech sample.
Example 3 includes the subject matter of Example 1 or 2, wherein said input tensor comprises a onehot vector comprising a plurality of channels, wherein each channel is set to 0 except for a single channel corresponding to a quantization value of an audio signal.
Example 4 includes the subject matter of Example 1, 2 or 3 wherein said postprocessing function comprises a first nonlinear activation function followed by a 1×1 convolution followed by a second nonlinear activation function.
Example 5 includes the subject matter of Example 4, wherein said first and second nonlinear activation functions are ReLU (“Rectified Linear Unit”) activation functions.
Example 6 includes the subject matter of Example 1, 2, 3, 4 or 5 further comprising during a training operation performing a zeropadding operation.
Example 7 includes the subject matter of Example 6, wherein said zeropadding operation comprises shifting said input tensor to the right by N samples, wherein said N samples are set to 0.
Example 8 is a system for generating speech samples comprising a plurality of FFTNet blocks arranged in series, wherein each FFTNet block includes a splitter module, a convolution module, a summation block, and a postprocessing module, wherein said postprocessing module generates an output based upon said composite tensor, a fully connected layer, wherein said fully connected layer is coupled to a last FFTNet block in said series; and, a softmax classifier coupled to an output of said fully connected layer.
Example 9 includes the subject matter of Example 8, wherein said postprocessing module comprises a first activation function block followed by a 1×1 convolution block followed by a second activation block.
Example 10 includes the subject matter of Example 9, wherein said first and second activation blocks implement a ReLU activation function.
Example 11 includes the subject matter of Example 8, wherein said convolution module performs a 1×1 convolution.
Example 12 includes the subject matter of Example 8, wherein said splitter module splits an input tensor into a left tensor and a right tensor;
Example 13 includes the subject matter of Example 12 wherein said convolution module performs a convolution upon said left tensor and said right tensor to generate a respective convolved left tensor and a convolved right tensor.
Example 14 includes the subject matter of Example 13, wherein said summation block generates a composite tensor based upon the convolved left tensor and the convolved right tensor.
Example 15 is a computer program product including one or more nontransitory machinereadable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for generating speech samples, the process comprising receiving an input tensor, splitting said received input tensor into a first portion and a second portion, performing a 1×1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result, summing said first intermediate result and said second intermediate result to generate a third intermediate result, applying a postprocessing function on said third intermediate result to generate a fourth intermediate result, computing an output tensor by summing said received input tensor with said fourth intermediate result, recursing by setting said input tensor to said output tensor until said output tensor is of size one in a predetermined dimension, and, performing a prediction of a speech sample using said output tensor of size one in a predetermined dimension.
Example 16 includes the subject matter of Example 15, wherein performing a prediction of a speech sample using said output tensor of size one in a predetermined dimension further comprises processing said output tensor by a fully connected neural network layer to generate a fifth intermediate result, and, applying a softmax classifier to said fifth intermediate result to generate a speech sample.
Example 17 includes the subject matter of Example 15 or 16, wherein said input tensor comprises a onehot vector comprising a plurality of channels, wherein each channel is set to 0 except for a single channel corresponding to a quantization value of an audio signal.
Example 18 includes the subject matter of Example 15, 16, 17 or 18 wherein said postprocessing function comprises a first nonlinear activation function followed by a 1×1 convolution followed by a second nonlinear activation function.
Example 19 includes the subject matter of Example 18, wherein said first and second nonlinear activation functions are ReLU (“Rectified Linear Unit”) activation functions.
Example 20 includes the subject matter of Example 15, 16, 17, 18 or 19 further comprising during a training operation performing a zeropadding operation.