Linguistic prosodic model-based text to speech
First Claim
1. A method, comprising:
- generating at least one linguistic prosodic model, each of the at least one linguistic prosodic model characterizing a corresponding linguistic prosody and being used to facilitate unit selection during text to speech processing, wherein the at least one linguistic prosodic model is generated from the recorded speech of a target speaker;
receiving an input text for text to speech processing;
generating, according to the input text, a target unit sequence and a linguistic target which annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties; and
producing synthesized speech using a selected unit sequence determined in accordance with the target unit sequence and the linguistic target based on an estimated joint cost;
wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model;
computing a context cost based on at least one context cost function;
computing a mismatch cost based on a syllable position mismatch matrix with elements defining costs associated with different types of syllable position mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
computing a concatenation cost; and
combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost.
13 Assignments
0 Petitions
Accused Products
Abstract
An arrangement is provided for text to speech processing based on linguistic prosodic models. Linguistic prosodic models are established to characterize different linguistic prosodic characteristics. When an input text is received, a target unit sequence is generated with a linguistic target that annotates target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties. A unit sequence is selected in accordance with the target unit sequence and the linguistic target based on joint cost information evaluated using established linguistic prosodic models. The selected unit sequence is used to produce synthesized speech corresponding to the input text.
-
Citations
47 Claims
-
1. A method, comprising:
-
generating at least one linguistic prosodic model, each of the at least one linguistic prosodic model characterizing a corresponding linguistic prosody and being used to facilitate unit selection during text to speech processing, wherein the at least one linguistic prosodic model is generated from the recorded speech of a target speaker;
receiving an input text for text to speech processing;
generating, according to the input text, a target unit sequence and a linguistic target which annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties; and
producing synthesized speech using a selected unit sequence determined in accordance with the target unit sequence and the linguistic target based on an estimated joint cost;
wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model;
computing a context cost based on at least one context cost function;
computing a mismatch cost based on a syllable position mismatch matrix with elements defining costs associated with different types of syllable position mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
computing a concatenation cost; and
combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for unit selection using at least one linguistic prosodic model, comprising:
-
receiving a target unit sequence with a linguistic target, wherein the linguistic target annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties;
identifying one or more candidate unit sequences, each of which comprises a plurality of units selected in accordance with the target unit sequence and the linguistic target;
estimating a joint cost associated with each of the candidate unit sequences, wherein said estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model, computing a context cost based on at least one context cost function, computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with the different types of stress/pitch accent mismatch;
computing a concatenation cost;
combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost; and
selecting one of the candidate unit sequences to be a selected unit sequence that has a minimum joint cost. - View Dependent Claims (17, 18, 19)
-
-
20. A unit selection based text to speech system, comprising:
-
a linguistic prosodic model generation mechanism;
a text-to-speech front end capable of generating, according to an input text, a target unit sequence and a linguistic target that annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target sequence and the linguistic target has certain desired prosodic properties;
a unit selection mechanism capable of selecting a unit sequence in accordance with the target unit sequence and the linguistic target based on an estimated joint cost wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model, computing a context cost based on at least one context cost function, computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
computing a concatenation cost;
combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost; and
a speech synthesis mechanism capable of synthesizing speech using the selected unit sequence. - View Dependent Claims (21, 22, 23, 24, 25)
-
-
26. A unit selection mechanism, comprising:
-
a unit search mechanism capable of identifying one or more candidate unit sequences in accordance with a target unit sequence and a linguistic target, wherein the linguistic target annotates the target unit sequence with a plurality of linguistic prosodic characteristics so that speech synthesized based on the target unit sequence and the linguistic target has certain desired prosodic properties;
a cost estimation mechanism capable of estimating a joint cost, for each of the candidate unit sequences, using at least one linguistic prosodic model generated to characterize at least one linguistic prosody;
wherein the cost estimation mechanism comprises a linguistic prosody cost estimator capable of computing a linguistic prosody cost associated with a candidate unit sequence based on at least some of the linguistic prosodic models, a mismatch cost estimator capable of computing a mismatch cost of the candidate unit sequence based on a syllable mismatch matrix with elements defining costs associated with syllable mismatches, a phrase position mismatch matrix with elements defining costs associated with phrase position mismatches, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
a context cost estimator capable of computing a context cost of the candidate unit sequence based on context cost functions;
a concatenation cost estimator capable of computing a concatenation cost of the candidate unit sequence;
a joint cost computation mechanism capable of combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost associated with the candidate unit sequence; and
a unit sequence selection mechanism capable of determining a selected unit sequence from the candidate unit sequences that best matches with the target unit sequence and the linguistic target based on the joint cost.
-
-
27. An article comprising a storage medium having stored thereon instructions that, when executed by a machine, result in the following:
-
generating at least one linguistic prosodic model, each of the at least one linguistic prosodic model characterizing a corresponding linguistic prosody and being used to facilitate unit selection during text to speech processing, wherein the at least one linguistic prosodic model is generated from the speech from a target speaker;
receiving an input text for text to speech processing;
generating, according to the input text, a target unit sequence and a linguistic target which annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties; and
producing synthesized speech using a selected unit sequence determined in accordance with the target unit sequence and the linguistic target based on an estimated joint cost wherein estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model, computing a context cost based on at least one context cost function, computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch, computing a concatenation cost; and
combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
-
-
40. An article comprising a storage medium having stored thereon instructions for unit selection using at least one linguistic prosodic model that, when executed by a machine, result in the following:
-
receiving a target unit sequence with a linguistic target, wherein the linguistic target annotates the target units in the target unit sequence with a plurality of linguistic prosodic characteristics so that the speech synthesized in accordance with the target unit sequence and the linguistic target has certain desired prosodic properties;
identifying one or more candidate unit sequences, each of which comprises a plurality of units selected in accordance with the target unit sequence and the linguistic target;
estimating a joint cost associated with each of the candidate unit sequences wherein said estimating the joint cost comprises computing a linguistic prosody cost based on the at least one linguistic prosodic model;
computing a context cast based on at least one context cost function;
computing a mismatch cost based on a syllable mismatch matrix with elements defining costs associated with different types of syllable mismatch, a phrase position mismatch matrix with elements defining costs associated with different types of phrase position mismatch, and a stress/pitch accent mismatch matrix with elements defining costs associated with different types of stress/pitch accent mismatch;
computing a concatenation cost; and
combining the linguistic prosody cost, the context cost, the mismatch cost, and the concatenation cost to generate the joint cost; and
selecting one of the candidate unit sequences to be a selected unit sequence that has a minimum joint cost. - View Dependent Claims (41, 42, 43, 44, 45, 46, 47)
-
Specification