Multilingual prosody generation
First Claim
1. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
accessing, by the one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages;
providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language;
generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and
providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multilingual prosody generation. In some implementations, data indicating a set of linguistic features corresponding to a text is obtained. Data indicating the linguistic features and data indicating the language of the text are provided as input to a neural network that has been trained to provide output indicating prosody information for multiple languages. The neural network can be a neural network having been trained using speech in multiple languages. Output indicating prosody information for the linguistic features is received from the neural network. Audio data representing the text is generated using the output of the neural network.
52 Citations
20 Claims
-
1. A system comprising:
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; accessing, by the one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages; providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language; generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
12. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
accessing, by the one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages; providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language; generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A computer-implemented method comprising:
-
accessing, by one or more computers, a neural network that has been trained, using speech in each of multiple languages, to be able to provide prosody information for each of the multiple languages; providing, by the one or more computers, input to the neural network that includes (i) a representation of a text in a first language and (ii) a language identifier for the first language; generating, by the one or more computers, audio data for a synthesized utterance of the text in the first language based on prosody information for the text that is output by the neural network in response to receiving the representation of the text and the language identifier for the first language; and providing, by the one or more computers, the audio data for the synthesized utterance of the text in the first language. - View Dependent Claims (18, 19, 20)
-
Specification