Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks
First Claim
1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions which, when executed by one or more processors of an electronic device, cause the electronic device to:
- receive text to be converted to speech;
generate a sequence of target units representing a spoken pronunciation of the text;
select, from a plurality of speech segments, a first candidate speech segment for a first target unit of the sequence of target units and a second candidate speech segment for a second target unit of the sequence of target units;
determine, using a set of acoustic features of the first candidate speech segment and a set of linguistic features of the second target unit, a set of predicted acoustic model parameters of the second target unit;
determine, using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment, a likelihood score of the second candidate speech segment with respect to the first candidate speech segment;
select the second candidate speech segment to be used in speech synthesis based on the determined likelihood score; and
generate speech corresponding to the received text using the second candidate speech segment.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and processes for performing unit-selection text-to-speech synthesis are provided. In one example process, a sequence of target units can represent a spoken pronunciation of text. A set of predicted acoustic model parameters of a second target unit can be determined using a set of acoustic features of a first candidate speech segment of a first target unit and a set of linguistic features of the second target unit. A likelihood score of the second candidate speech segment with respect to the first candidate speech segment can be determined using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment of the second target unit. The second candidate speech segment can be selected for speech synthesis based on the determined likelihood score. Speech corresponding to the received text can be generated using the selected second candidate speech segment.
3677 Citations
27 Claims
-
1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions which, when executed by one or more processors of an electronic device, cause the electronic device to:
-
receive text to be converted to speech; generate a sequence of target units representing a spoken pronunciation of the text; select, from a plurality of speech segments, a first candidate speech segment for a first target unit of the sequence of target units and a second candidate speech segment for a second target unit of the sequence of target units; determine, using a set of acoustic features of the first candidate speech segment and a set of linguistic features of the second target unit, a set of predicted acoustic model parameters of the second target unit; determine, using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment, a likelihood score of the second candidate speech segment with respect to the first candidate speech segment; select the second candidate speech segment to be used in speech synthesis based on the determined likelihood score; and generate speech corresponding to the received text using the second candidate speech segment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A method for performing unit-selection text-to-speech synthesis, comprising:
at an electronic device having a processor and memory; receiving text to be converted to speech; generating a sequence of target units representing a spoken pronunciation of the text; selecting, from a plurality of speech segments, a first candidate speech segment for a first target unit of the sequence of target units and a second candidate speech segment for a second target unit of the sequence of target units; determining, using a set of acoustic features of the first candidate speech segment and a set of linguistic features of the second target unit, a set of predicted acoustic model parameters of the second target unit; determining, using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment, a likelihood score of the second candidate speech segment with respect to the first candidate speech segment; selecting the second candidate speech segment to be used in speech synthesis based on the determined likelihood score; and generating speech corresponding to the received text using the second candidate speech segment.
-
27. A system for performing unit-selection text-to-speech synthesis, the system comprising:
-
one or more processors; and memory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to; receive text to be converted to speech; generate a sequence of target units representing a spoken pronunciation of the text; select, from a plurality of speech segments, a first candidate speech segment for a first target unit of the sequence of target units and a second candidate speech segment for a second target unit of the sequence of target units; determine, using a set of acoustic features of the first candidate speech segment and a set of linguistic features of the second target unit, a set of predicted acoustic model parameters of the second target unit; determine, using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment, a likelihood score of the second candidate speech segment with respect to the first candidate speech segment; select the second candidate speech segment to be used in speech synthesis based on the determined likelihood score; and generate speech corresponding to the received text using the second candidate speech segment.
-
Specification