Speech synthesis using deep neural networks
First Claim
1. A method comprising:
- training a neural network implemented by one or more processors of a system to map one or more training-time sequences of phonetic-context descriptors received by the neural network into training-time predicted feature vectors that correspond to acoustic properties of predefined speech waveforms, wherein the one or more training-time sequences of phonetic-context descriptors correspond to phonetic transcriptions of training-time text strings, and the training-time text strings correspond to written transcriptions of speech carried in the predefined speech waveforms;
receiving, by a text analysis module implemented by the one or more processors, a run-time text string;
processing the received run-time text string with the text analysis module to generate a run-time sequence of phonetic-context descriptors that corresponds to a phonetic transcription of the run-time text string, wherein each phonetic-context descriptor of the run-time sequence includes a respective label identifying a phonetic speech unit of a plurality of phonetic speech units, data indicating phonetic context of the identified phonetic speech unit, and data indicating time duration of the identified phonetic speech unit;
processing the run-time sequence of the phonetic-context descriptors with the trained neural network in a corresponding sequence of neural network time steps to generate one or more run-time predicted feature vectors; and
processing the one or more run-time predicted feature vectors with a signal generation module to produce and output a run-time speech waveform corresponding to a spoken rendering of the received run-time text string,wherein processing the received run-time text string with the text analysis module to generate the run-time sequence of phonetic-context descriptors comprises;
generating a run-time transcription sequence of phonetic speech units that corresponds to the phonetic transcription of the run-time text string; and
determining a respective number of consecutive phonetic-context descriptors to generate for each of the phonetic speech units of the run-time transcription sequence.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for is disclosed for speech synthesis using deep neural networks. A neural network may be trained to map input phonetic transcriptions of training-time text strings into sequences of acoustic feature vectors, which yield predefined speech waveforms when processed by a signal generation module. The training-time text strings may correspond to written transcriptions of speech carried in the predefined speech waveforms. Subsequent to training, a run-time text string may be translated to a run-time phonetic transcription, which may include a run-time sequence of phonetic-context descriptors, each of which contains a phonetic speech unit, data indicating phonetic context, and data indicating time duration of the respective phonetic speech unit. The trained neural network may then map the run-time sequence of the phonetic-context descriptors to run-time predicted feature vectors, which may in turn be translated into synthesized speech by the signal generation module.
292 Citations
26 Claims
-
1. A method comprising:
-
training a neural network implemented by one or more processors of a system to map one or more training-time sequences of phonetic-context descriptors received by the neural network into training-time predicted feature vectors that correspond to acoustic properties of predefined speech waveforms, wherein the one or more training-time sequences of phonetic-context descriptors correspond to phonetic transcriptions of training-time text strings, and the training-time text strings correspond to written transcriptions of speech carried in the predefined speech waveforms; receiving, by a text analysis module implemented by the one or more processors, a run-time text string; processing the received run-time text string with the text analysis module to generate a run-time sequence of phonetic-context descriptors that corresponds to a phonetic transcription of the run-time text string, wherein each phonetic-context descriptor of the run-time sequence includes a respective label identifying a phonetic speech unit of a plurality of phonetic speech units, data indicating phonetic context of the identified phonetic speech unit, and data indicating time duration of the identified phonetic speech unit; processing the run-time sequence of the phonetic-context descriptors with the trained neural network in a corresponding sequence of neural network time steps to generate one or more run-time predicted feature vectors; and processing the one or more run-time predicted feature vectors with a signal generation module to produce and output a run-time speech waveform corresponding to a spoken rendering of the received run-time text string, wherein processing the received run-time text string with the text analysis module to generate the run-time sequence of phonetic-context descriptors comprises; generating a run-time transcription sequence of phonetic speech units that corresponds to the phonetic transcription of the run-time text string; and determining a respective number of consecutive phonetic-context descriptors to generate for each of the phonetic speech units of the run-time transcription sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
one or more processors; memory; and machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out operations comprising; training a neural network implemented by the system to map one or more training-time sequences of phonetic-context descriptors received by the neural network into training-time predicted feature vectors that correspond to acoustic properties of predefined speech waveforms, wherein the one or more training-time sequences of phonetic-context descriptors correspond to phonetic transcriptions of training-time text strings, and the training-time text strings correspond to written transcriptions of speech carried in the predefined speech waveforms, receiving, by a text analysis module implemented by the system, a run-time text string, processing the received run-time text string with the text analysis module to generate a run-time sequence of phonetic-context descriptors that corresponds to a phonetic transcription of the run-time text string, wherein each phonetic-context descriptor of the run-time sequence includes a respective label identifying a phonetic speech unit of a plurality of phonetic speech units, data indicating phonetic context of the identified phonetic speech unit, and data indicating time duration of the identified phonetic speech unit, processing the run-time sequence of the phonetic-context descriptors with the trained neural network in a corresponding sequence of neural network time steps to generate one or more run-time predicted feature vectors, and processing the one or more run-time predicted feature vectors with a signal generation module to produce and output a run-time speech waveform corresponding to a spoken rendering of the received run-time text string, wherein processing the received run-time text string with the text analysis module to generate the run-time sequence of phonetic-context descriptors comprises; generating a run-time transcription sequence of phonetic speech units that corresponds to the phonetic transcription of the run-time text string; and determining a respective number of consecutive phonetic-context descriptors to generate for each of the phonetic speech units of the run-time transcription sequence. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
-
19. An article of manufacture including a computer-readable storage medium, having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
-
training a neural network implemented by the system to map one or more training-time sequences of phonetic-context descriptors received by the neural network into training-time predicted feature vectors that correspond to acoustic properties of predefined speech waveforms, wherein the one or more training-time sequences of phonetic-context descriptors correspond to phonetic transcriptions of training-time text strings, and the training-time text strings correspond to written transcriptions of speech carried in the predefined speech waveforms; receiving, by a text analysis module implemented by the system, a run-time text string; processing the received run-time text string with the text analysis module to generate a run-time sequence of phonetic-context descriptors that corresponds to a phonetic transcription of the run-time text string, wherein each phonetic-context descriptor of the run-time sequence includes a respective label identifying a phonetic speech unit of a plurality of phonetic speech units, data indicating phonetic context of the identified phonetic speech unit, and data indicating time duration of the identified phonetic speech unit; processing the run-time sequence of the phonetic-context descriptors with the trained neural network in a corresponding sequence of neural network time steps to generate one or more run-time predicted feature vectors; and processing the one or more run-time predicted feature vectors with a signal generation module to produce and output a run-time speech waveform corresponding to a spoken rendering of the received run-time text string; wherein processing the received run-time text string with the text analysis module to generate the run-time sequence of phonetic-context descriptors comprises; generating a run-time transcription sequence of phonetic speech units that corresponds to the phonetic transcription of the run-time text string; and determining a respective number of consecutive phonetic-context descriptors to generate for each of the phonetic speech units of the run-time transcription sequence. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
-
Specification