REAL-TIME SPEAKER-DEPENDENT NEURAL VOCODER
First Claim
Patent Images
1. A method for generating speech samples, the method comprising:
- receiving an input tensor;
splitting said received input tensor into a first portion and a second portion;
performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;
summing said first intermediate result and said second intermediate result to generate a third intermediate result;
applying a post-processing function on said third intermediate result to generate a fourth intermediate result;
computing an output tensor by summing said received input tensor with said fourth intermediate result;
recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for a recursive deep-learning approach for performing speech synthesis using a repeatable structure that splits an input tensor into a left half and right half similar to the operation of the Fast Fourier Transform, performs a 1-D convolution on each respective half, performs a summation and then applies a post-processing function. The repeatable structure may be utilized in a series configuration to operate as a vocoder or perform other speech processing functions.
-
Citations
20 Claims
-
1. A method for generating speech samples, the method comprising:
-
receiving an input tensor; splitting said received input tensor into a first portion and a second portion; performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;summing said first intermediate result and said second intermediate result to generate a third intermediate result; applying a post-processing function on said third intermediate result to generate a fourth intermediate result; computing an output tensor by summing said received input tensor with said fourth intermediate result; recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for generating speech samples comprising:
-
a plurality of FFTNet blocks arranged in series, wherein each FFTNet block includes a splitter module, a convolution module, a summation block, and a post-processing module, wherein said post-processing module generates an output based upon said composite tensor, a fully connected layer, wherein said fully connected layer is coupled to a last FFTNet block in said series; and
,a softmax classifier coupled to an output of said fully connected layer. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product including one or more non-transitory machine-readable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for generating speech samples, the process comprising:
-
receiving an input tensor; splitting said received input tensor into a first portion and a second portion; performing a 1×
1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;summing said first intermediate result and said second intermediate result to generate a third intermediate result; applying a post-processing function on said third intermediate result to generate a fourth intermediate result; computing an output tensor by summing said received input tensor with said fourth intermediate result; recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification