Real-time speaker-dependent neural vocoder

US 10,770,063 B2
Filed: 08/22/2018
Issued: 09/08/2020
Est. Priority Date: 04/13/2018
Status: Active Grant

First Claim

Patent Images

1. A method for generating speech samples, the method comprising:

receiving an input tensor;

splitting said received input tensor into a first portion and a second portion;

performing a 1×

1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;

summing said first intermediate result and said second intermediate result to generate a third intermediate result;

applying a post-processing function on said third intermediate result to generate a fourth intermediate result;

computing an output tensor by summing said received input tensor with said fourth intermediate result;

recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and

,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for a recursive deep-learning approach for performing speech synthesis using a repeatable structure that splits an input tensor into a left half and right half similar to the operation of the Fast Fourier Transform, performs a 1-D convolution on each respective half, performs a summation and then applies a post-processing function. The repeatable structure may be utilized in a series configuration to operate as a vocoder or perform other speech processing functions.

26 Citations

17 Claims

1. A method for generating speech samples, the method comprising:
- receiving an input tensor;
  
  splitting said received input tensor into a first portion and a second portion;
  
  performing a 1×
  
  1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;
  
  summing said first intermediate result and said second intermediate result to generate a third intermediate result;
  
  applying a post-processing function on said third intermediate result to generate a fourth intermediate result;
  
  computing an output tensor by summing said received input tensor with said fourth intermediate result;
  
  recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
  
  ,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension further comprises:
    - processing said output tensor by a fully connected neural network layer to generate a fifth intermediate result; and
      
      ,applying a softmax classifier to said fifth intermediate result to generate a speech sample.
  - 3. The method according to claim 1, wherein said input tensor comprises a one-hot vector comprising a plurality of channels, wherein each channel is set to 0 except for a single channel corresponding to a quantization value of an audio signal.
  - 4. The method according to claim 1, wherein said post-processing function comprises a first non-linear activation function followed by a 1×
    - 1 convolution followed by a second non-linear activation function.
  - 5. The method according to claim 4, wherein said first and second non-linear activation functions are ReLU (“
    - Rectified Linear Unit”
      
      ) activation functions.
  - 6. The method according to claim 1, further comprising during a training operation performing a zero-padding operation.
  - 7. The method according to claim 6, wherein said zero-padding operation comprises shifting said input tensor to the right by N samples, wherein said N samples are set to 0.

8. A system for generating speech samples comprising:
- a plurality of FFTNet blocks arranged in series, wherein each FFTNet block includesa splitter module that splits an input tensor into left and right tensors,a convolution module that performs a convolution upon said left and right tensors to generate respective convolved left and right tensors,a summation block that generates a composite tensor based on the convolved left and right tensors, anda post-processing module, wherein said post-processing module generates an output tensor based upon said composite tensor, and wherein said plurality of FFTNet blocks recurse by setting said input tensor of one of said FFTNet blocks to said output tensor of another of said FFTNet blocks;
  
  a fully connected layer, wherein said fully connected layer is coupled to a last FFTNet block in said series; and
  
  ,a softmax classifier coupled to an output of said fully connected layer.
- View Dependent Claims (9, 10, 11)
- - 9. The system according to claim 8, wherein said post-processing module comprises a first activation function block followed by a 1×
    - 1 convolution block followed by a second activation block.
  - 10. The system according to claim 9, wherein said first and second activation blocks implement a ReLU activation function.
  - 11. The system according to claim 8, wherein said convolution module performs a 1×
    - 1 convolution.

12. A computer program product including one or more non-transitory machine-readable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for generating speech samples, the process comprising:
- receiving an input tensor;
  
  splitting said received input tensor into a first portion and a second portion;
  
  performing a 1×
  
  1 convolution respectively on said first portion and said second portion to generate a respective first intermediate result and a second intermediate result;
  
  summing said first intermediate result and said second intermediate result to generate a third intermediate result;
  
  applying a post-processing function on said third intermediate result to generate a fourth intermediate result;
  
  computing an output tensor by summing said received input tensor with said fourth intermediate result;
  
  recursing by setting said input tensor to said output tensor until said output tensor is of size one in a pre-determined dimension; and
  
  ,performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The computer program product according to claim 12, wherein performing a prediction of a speech sample using said output tensor of size one in a pre-determined dimension further comprises:
    - processing said output tensor by a fully connected neural network layer to generate a fifth intermediate result; and
      
      ,applying a softmax classifier to said fifth intermediate result to generate a speech sample.
  - 14. The computer program product according to claim 12, wherein said input tensor comprises a one-hot vector comprising a plurality of channels, wherein each channel is set to 0 except for a single channel corresponding to a quantization value of an audio signal.
  - 15. The computer program product according to claim 12, wherein said post-processing function comprises a first non-linear activation function followed by a 1×
    - 1 convolution followed by a second non-linear activation function.
  - 16. The computer program product according to claim 15, wherein said first and second non-linear activation functions are ReLU (“
    - Rectified Linear Unit”
      
      ) activation functions.
  - 17. The computer program product according to claim 12, further comprising during a training operation performing a zero-padding operation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc., The Trustees of Princeton University (Princeton University)
Original Assignee
Adobe Inc.
Inventors
Jin, Zeyu, Mysore, Gautham J., Lu, Jingwan, Finkelstein, Adam
Primary Examiner(s)
Baker, Matthew H

Application Number

US16/108,996
Publication Number

US 20190318726A1
Time in Patent Office

748 Days
Field of Search
US Class Current
CPC Class Codes

G06F 17/142   Fast Fourier transforms, e....

G06N 3/04   Architecture, e.g. intercon...

G06N 3/045   Combinations of networks

G06N 3/048   Activation functions

G06N 3/08   Learning methods

G06N 3/084   Backpropagation, e.g. using...

G10L 13/02   Methods for producing synth...

G10L 15/16   using artificial neural net...

G10L 15/22   Procedures used during a sp...

Real-time speaker-dependent neural vocoder

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

26 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Real-time speaker-dependent neural vocoder

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links