Artificial intelligence-based text-to-speech system and method
First Claim
1. A text-to-speech (TTS) training system comprising:
- a subsystem configured to receive an input vector from conversion of text, the subsystem including a neural network interacting with a pre-existing knowledgebase of phonemes to apply an error signal to correct for speech signal distortions of the pre-existing knowledgebase of phonemes to generate a corrected speech signal; and
a training subsystem coupled to the subsystem, the training subsystem configured to iteratively correct the subsystem for the speech signal distortions of the pre-existing knowledgebase of phonemes based on psychoacoustic processing for the subsystem to apply the error signal to correct for the speech signal distortions of the pre-existing knowledgebase of phonemes to generate the corrected speech signal, wherein the training subsystem is further configured to ignore inaudible errors of the corrected speech signal based on masking.
3 Assignments
0 Petitions
Accused Products
Abstract
A technique improves training and speech quality of a text-to-speech (TTS) system having an artificial intelligence, such as a neural network. The TTS system is organized as a front-end subsystem and a back-end subsystem. The front-end subsystem is configured to provide analysis and conversion of text into input vectors, each having at least a base frequency, f0, a phenome duration, and a phoneme sequence that is processed by a signal generation unit of the back-end subsystem. The signal generation unit includes the neural network interacting with a pre-existing knowledgebase of phenomes to generate audible speech from the input vectors. The technique applies an error signal from the neural network to correct imperfections of the pre-existing knowledgebase of phenomes to generate audible speech signals. A back-end training system is configured to train the signal generation unit by applying psychoacoustic principles to improve quality of the generated audible speech signals.
20 Citations
20 Claims
-
1. A text-to-speech (TTS) training system comprising:
-
a subsystem configured to receive an input vector from conversion of text, the subsystem including a neural network interacting with a pre-existing knowledgebase of phonemes to apply an error signal to correct for speech signal distortions of the pre-existing knowledgebase of phonemes to generate a corrected speech signal; and a training subsystem coupled to the subsystem, the training subsystem configured to iteratively correct the subsystem for the speech signal distortions of the pre-existing knowledgebase of phonemes based on psychoacoustic processing for the subsystem to apply the error signal to correct for the speech signal distortions of the pre-existing knowledgebase of phonemes to generate the corrected speech signal, wherein the training subsystem is further configured to ignore inaudible errors of the corrected speech signal based on masking. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of training text-to-speech (TTS) processing comprising:
-
receiving, by a subsystem, an input vector from conversion of text; interacting, by a neural network of the subsystem, with a pre-existing knowledgebase of phonemes to apply an error signal to correct for speech signal distortions of the pre-existing knowledgebase of phonemes to generate a corrected speech signal; iteratively correcting, by a training subsystem coupled to the subsystem, the subsystem for the speech signal distortions of the pre-existing knowledgebase of phonemes based on psychoacoustic processing for the subsystem to apply the error signal to correct for the speech signal distortions of the pre-existing knowledgebase of phonemes to generate the corrected speech signal; and ignoring, by the training subsystem, inaudible errors of the corrected speech signal based on masking. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer-readable medium having program instructions for training text-to-speech (TTS) processing which, when executed across one or more processors, causes at least a portion of the one or more processors to perform operations comprising:
-
receiving, by a subsystem, an input vector from conversion of text; interacting, by a neural network of the subsystem, with a pre-existing knowledgebase of phonemes to apply an error signal to correct for speech signal distortions of the pre-existing knowledgebase of phonemes to generate a corrected speech signal; iteratively correcting, by a training subsystem coupled to the subsystem, the subsystem for the speech signal distortions of the pre-existing knowledgebase of phonemes based on psychoacoustic processing to apply the error signal to correct for the speech signal distortions of the pre-existing knowledgebase of phonemes to generate the corrected speech signal; and ignoring, by the training subsystem, inaudible errors of the corrected speech signal based on masking.
-
Specification