Rich context modeling for text-to-speech engines
First Claim
Patent Images
1. A computer readable medium storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
- obtaining trained decision tree-tied hidden Markov Models (HMMs) for a speech corpus;
estimating mean parameters of a plurality of rich context models based on the trained decision tree-tied HMMs by performing a single pass re-estimation;
setting variance parameters of the plurality of rich context models equal to the variance parameters of the trained decision tree-tied HMMs to produce a plurality of refined rich context models; and
generating synthesized speech for an input text based at least on some of the plurality of refined rich context models.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of rich context modeling for speech synthesis are disclosed. In operation, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.
-
Citations
23 Claims
-
1. A computer readable medium storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
-
obtaining trained decision tree-tied hidden Markov Models (HMMs) for a speech corpus; estimating mean parameters of a plurality of rich context models based on the trained decision tree-tied HMMs by performing a single pass re-estimation; setting variance parameters of the plurality of rich context models equal to the variance parameters of the trained decision tree-tied HMMs to produce a plurality of refined rich context models; and generating synthesized speech for an input text based at least on some of the plurality of refined rich context models. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer implemented method, comprising:
-
under control of one or more computing systems configured with executable instructions, refining a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models; performing pre-selection to compose a rich context model candidate sausage for an input text, the candidate sausage including a plurality of refined rich context model sequences, each sequence including at least some refined rich context models from the plurality of refined rich context models; selecting one of the plurality of refined rich context model sequences that has a least divergence from a guiding sequence that is obtained from the decision tree-tied HMMs; and generating output speech for the input text based at least on a rich context model sequence that is selected from the plurality of refined rich context model sequences. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A system, comprising:
-
one or more processors; a memory that includes a plurality of computer-executable components, the plurality of computer-executable components comprising; a training module to refine a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models; a pre-selection module to perform pre-selection to compose a rich context model candidate sausage for an input text, the candidate sausage including a plurality of refined rich context model sequences, each sequence including at least some refined rich context models from the plurality of refined rich context models; a unit pruning module to implement unit pruning along the candidate sausage to select one or more rich context model sequences with less than a predetermined amount of distortion from a guiding sequence, the guiding sequence obtained from the decision tree-tied HMMs; a cross correlation search module to conduct a normalized cross correlation-based search to derive a minimal concatenation cost rich context model sequence from the one or more rich context model sequences; a waveform concatenation module to concatenate waveform units of an input text along a path of the minimal concatenation cost rich context model sequence to generate a waveform sequence; and a synthesis module to generate synthesized speech for the input text based at least on the waveform sequence. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
Specification