Text-to-speech synthesis using an autoencoder
First Claim
1. A method performed by one or more computers of a text-to-speech system, the method comprising:
- obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis;
providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein;
the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units;
the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and
the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder;
receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder;
selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and
providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and computer-readable media for text-to-speech synthesis using an autoencoder. In some implementations, data indicating a text for text-to-speech synthesis is obtained. Data indicating a linguistic unit of the text is provided as input to an encoder. The encoder is configured to output speech unit representations indicative of acoustic characteristics based on linguistic information. A speech unit representation that the encoder outputs is received. A speech unit is selected to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder. Audio data for a synthesized utterance of the text that includes the selected speech unit is provided.
-
Citations
17 Claims
-
1. A method performed by one or more computers of a text-to-speech system, the method comprising:
-
obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis; providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein; the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units; the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder; receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder; selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
one or more computers; and one or more data storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising; obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis; providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein; the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units; the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder; receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder; selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit. - View Dependent Claims (11, 12, 13)
-
-
14. One or more non-transitory computer-readable media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
-
obtaining, by the one or more computers, data indicating a text for text-to-speech synthesis; providing, by the one or more computers, data indicating a linguistic unit of the text as input to an encoder, the encoder being configured to output speech unit representations indicative of acoustic characteristics based on linguistic information, wherein the encoder is configured to provide speech unit representations learned through machine learning training, wherein the encoder comprises a neural network that was trained as part of an autoencoder network that includes the encoder, a second encoder, and a decoder, wherein; the encoder is arranged to produce speech unit representations in response to receiving data indicating linguistic units; the second encoder is arranged to produce speech unit representations in response to receiving data indicating acoustic features of speech units; and the decoder is arranged to generate output indicating acoustic features of speech units in response to receiving speech unit representations for the speech units from either of the encoder and the second encoder; receiving, by the one or more computers, a speech unit representation that the encoder outputs in response to receiving the data indicating the linguistic unit as input to the encoder; selecting, by the one or more computers, a speech unit to represent the linguistic unit, the speech unit being selected from among a collection of speech units based on the speech unit representation output by the encoder; and providing, by the one or more computers and as output of the text-to-speech system, audio data for a synthesized utterance of the text that includes the selected speech unit. - View Dependent Claims (15, 16, 17)
-
Specification