Deep networks for unit selection speech synthesis
First Claim
Patent Images
1. A method comprising:
- obtaining a set of phones that is associated with text that is to be synthesized into speech;
accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones;
providing a particular set of phones for input to the neural network;
receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones;
determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample;
selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample;
synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and
providing the speech for output.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for providing a representation based on structured data in resources. The methods, systems, and apparatus include actions of receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features. Additional actions include determining a distance between the target acoustic features and acoustic features of a stored acoustic sample. Further actions include selecting the acoustic sample to be used in speech synthesis based at least on the determined distance and synthesizing speech based on the selected acoustic sample.
8 Citations
17 Claims
-
1. A method comprising:
-
obtaining a set of phones that is associated with text that is to be synthesized into speech; accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones; providing a particular set of phones for input to the neural network; receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones; determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample; selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample; synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and providing the speech for output. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; obtaining a set of phones that is associated with text that is to be synthesized into speech; accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones; providing a particular set of phones for input to the neural network; receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones; determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample; selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample; synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and providing the speech for output. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
-
obtaining a set of phones that is associated with text that is to be synthesized into speech; accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones; providing a particular set of phones for input to the neural network; receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones; determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample; selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample; synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and providing the speech for output. - View Dependent Claims (14, 15, 16, 17)
-
Specification