Artificial intelligence-based text-to-speech system and method
First Claim
1. A text-to-speech (TTS) system comprising:
- a front-end subsystem configured to provide analysis and conversion of text into an input vector having a base frequency for a phoneme, a phenome duration, and a phoneme sequence; and
a back-end subsystem coupled to the front-end subsystem and configured to convert the input vector of the base frequency, the phoneme duration and the phoneme sequence into an intermediate vector for processing by a signal generation unit of the back-end subsystem, the signal generation unit having a neural network interacting with a pre-existing knowledgebase of phonemes, wherein the signal generation unit is configured to use the neural network interacting with the pre-existing knowledgebase of phonemes to apply an error signal to correct for speech signal distortions of the pre-existing knowledgebase of phonemes to generate the speech signal.
3 Assignments
0 Petitions
Accused Products
Abstract
A technique improves training and speech quality of a text-to-speech (TTS) system having an artificial intelligence, such as a neural network. The TTS system is organized as a front-end subsystem and a back-end subsystem. The front-end subsystem is configured to provide analysis and conversion of text into input vectors, each having at least a base frequency, f0, a phenome duration, and a phoneme sequence that is processed by a signal generation unit of the back-end subsystem. The signal generation unit includes the neural network interacting with a pre-existing knowledgebase of phenomes to generate audible speech from the input vectors. The technique applies an error signal from the neural network to correct imperfections of the pre-existing knowledgebase of phenomes to generate audible speech signals. Speech signal specific modelling techniques in combination with applied psychoacoustic principles drive training efficiency of neural networks with positive impact on quality of generated speech signals.
29 Citations
20 Claims
-
1. A text-to-speech (TTS) system comprising:
-
a front-end subsystem configured to provide analysis and conversion of text into an input vector having a base frequency for a phoneme, a phenome duration, and a phoneme sequence; and a back-end subsystem coupled to the front-end subsystem and configured to convert the input vector of the base frequency, the phoneme duration and the phoneme sequence into an intermediate vector for processing by a signal generation unit of the back-end subsystem, the signal generation unit having a neural network interacting with a pre-existing knowledgebase of phonemes, wherein the signal generation unit is configured to use the neural network interacting with the pre-existing knowledgebase of phonemes to apply an error signal to correct for speech signal distortions of the pre-existing knowledgebase of phonemes to generate the speech signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of processing text-to-speech comprising:
-
receiving an input vector having a base frequency for a phoneme, a phenome duration for the phoneme, and a phoneme sequence; upsampling the input vector of the base frequency, the phoneme duration and the phoneme sequence into an intermediate vector; generating a speech signal from the intermediate vector using a pre-existing knowledgebase of phonemes; and applying an error signal from a neural network to correct for speech signal distortions of the speech signal based on an interaction between the neural network and the pre-existing knowledgebase. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable medium having program instructions which, when executed across one or more processors, causes at least a portion of the one or more processors to perform operations comprising:
-
receiving an input vector having a base frequency for a phoneme, a phenome duration for the phoneme, and a phoneme sequence; upsampling the input vector of the base frequency, the phoneme duration and the phoneme sequence into an intermediate vector; generating a speech signal from the intermediate vector using a pre-existing knowledgebase of phonemes; and applying an error signal from a neural network to correct for speech signal distortions of the speech signal based on interactions between the pre-existing knowledgebase.
-
Specification