UNIT-SELECTION TEXT-TO-SPEECH SYNTHESIS BASED ON PREDICTED CONCATENATION PARAMETERS
First Claim
1. A system for unit-selection text-to-speech synthesis, the system comprising:
- one or more processors; and
memory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to;
receive text to be converted to speech;
generate a sequence of target units representing a spoken pronunciation of the text;
determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit;
select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;
for each candidate speech segment of the plurality of candidate speech segments;
determine a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and
determine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units;
select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and
generate speech corresponding to the received text using the subset of candidate speech segments.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and processes for performing unit-selection text-to-speech synthesis are provided. In an example process, text to be converted to speech is received. The text is represented as a sequence of target units. A plurality of candidate speech segments corresponding to the sequence of target units are selected. Predicted statistical parameters of acoustic features associated with the sequence of target units are determined. The predicted statistical parameters of acoustic features are used to determine target costs and concatenation costs associated with the plurality of candidate speech segments. Based on a combined cost determined from the target costs and concatenation costs, a subset of candidate speech segments is selected from the plurality of candidate speech segments. Speech corresponding to the received text is generated using the subset of candidate speech segments.
-
Citations
25 Claims
-
1. A system for unit-selection text-to-speech synthesis, the system comprising:
-
one or more processors; and memory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to; receive text to be converted to speech; generate a sequence of target units representing a spoken pronunciation of the text; determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit; select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units; for each candidate speech segment of the plurality of candidate speech segments; determine a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and determine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units; select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and generate speech corresponding to the received text using the subset of candidate speech segments. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method for unit-selection text-to-speech synthesis, comprising:
at an electronic device having a processor and memory; receiving text to be converted to speech; generating a sequence of target units representing a spoken pronunciation of the text; determining, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit; selecting, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units; for each candidate speech segment of the plurality of candidate speech segments; determining a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and determining a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units; selecting from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and generating speech corresponding to the received text using the subset of candidate speech segments. - View Dependent Claims (21, 22, 23, 24)
-
25. A non-transitory computer-readable storage medium comprising computer-readable instructions which, when executed by one or more processors, cause the one or more processors to:
-
receive text to be converted to speech; generate a sequence of target units representing a spoken pronunciation of the text; determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit; select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units; for each candidate speech segment of the plurality of candidate speech segments; determine a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and determine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units; select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and generate speech corresponding to the received text using the subset of candidate speech segments.
-
Specification