Real-time speaker-dependent neural vocoder
First Claim
Patent Images
1. A method for generating speech samples, the method comprising:
- receiving an input tensor;
splitting said received input tensor into a first portion and a second portion;
performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;
summing said first intermediate result and said second intermediate result to generate a third intermediate result;
applying a post-processing function on said third intermediate result to generate a fourth intermediate result;
computing an output tensor by summing said received input tensor with said fourth intermediate result;
recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for a recursive deep-learning approach for performing speech synthesis using a repeatable structure that splits an input tensor into a left half and right half similar to the operation of the Fast Fourier Transform, performs a 1-D convolution on each respective half, performs a summation and then applies a post-processing function. The repeatable structure may be utilized in a series configuration to operate as a vocoder or perform other speech processing functions.
26 Citations
17 Claims
-
1. A method for generating speech samples, the method comprising:
-
receiving an input tensor; splitting said received input tensor into a first portion and a second portion; performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;summing said first intermediate result and said second intermediate result to generate a third intermediate result; applying a post-processing function on said third intermediate result to generate a fourth intermediate result; computing an output tensor by summing said received input tensor with said fourth intermediate result; recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for generating speech samples comprising:
-
a plurality of FFTNet blocks arranged in series, wherein each FFTNet block includes a splitter module that splits an input tensor into left and right tensors, a convolution module that performs a convolution upon said left and right tensors to generate respective convolved left and right tensors, a summation block that generates a composite tensor based on the convolved left and right tensors, and a post-processing module, wherein said post-processing module generates an output tensor based upon said composite tensor, and wherein said plurality of FFTNet blocks recurse by setting said input tensor of one of said FFTNet blocks to said output tensor of another of said FFTNet blocks; a fully connected layer, wherein said fully connected layer is coupled to a last FFTNet block in said series; and
,a softmax classifier coupled to an output of said fully connected layer. - View Dependent Claims (9, 10, 11)
-
-
12. A computer program product including one or more non-transitory machine-readable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for generating speech samples, the process comprising:
-
receiving an input tensor; splitting said received input tensor into a first portion and a second portion; performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;summing said first intermediate result and said second intermediate result to generate a third intermediate result; applying a post-processing function on said third intermediate result to generate a fourth intermediate result; computing an output tensor by summing said received input tensor with said fourth intermediate result; recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension. - View Dependent Claims (13, 14, 15, 16, 17)
-
Specification