Generation and synthesis of prosody templates
First Claim
1. A template generation system for generating a duration template from a plurality of input words, comprising:
- a phonetic processor operable to segment each of said input words into input phonemes and group said input phonemes into constituent syllables, each of said constituent syllables having an associated syllable duration;
a phoneme clustering module to cluster said input phonemes comprising a constituent syllable into input phoneme pairs and input single phonemes;
a global static table containing a plurality of stored phonemes comprising stored phoneme pairs and stored single phonemes, each of said stored phonemes having associated static duration information;
a normalization module to generate a normalized duration value for each of said constituent syllables, wherein said normalized duration value is generated by dividing the syllable duration by the combined static duration of the corresponding stored phonemes that comprise said constituent syllable;
the duration template for storing the normalized duration value, said template being specified by text grouping feature, such that the normalized duration value for each constituent syllable having a specific grouping feature is contained in the associated duration template.
4 Assignments
0 Petitions
Accused Products
Abstract
A method of separating high-level prosodic behavior from purely articulatory constraints so that timing information can be extracted from human speech is presented. The extracted timing information is used to construct duration templates that are employed for speech synthesis. The duration templates are constructed so that words exhibiting the same stress pattern will be assigned the same duration template. Initially, the words of input text segmented into phonemes and syllables, and the associated stress pattern is assigned. The stress assigned words are then assigned grouping features by a text grouping module. A phoneme cluster module groups the phonemes into phoneme pairs and single phonemes. A static duration associated with each phoneme pair and single phoneme is retrieved from a global static table. A normalization module generates a normalized syllable duration value based upon the retrieved static durations associated with the phonemes that comprise the syllable. The normalized syllable duration value is stored in a duration template based upon the grouping features associated with that syllable. To produce natural human-sounding prosody in synthesized speech, the duration information is then extracted from the selected template, de-normalized and applied to the phonemic information.
220 Citations
18 Claims
-
1. A template generation system for generating a duration template from a plurality of input words, comprising:
-
a phonetic processor operable to segment each of said input words into input phonemes and group said input phonemes into constituent syllables, each of said constituent syllables having an associated syllable duration;
a phoneme clustering module to cluster said input phonemes comprising a constituent syllable into input phoneme pairs and input single phonemes;
a global static table containing a plurality of stored phonemes comprising stored phoneme pairs and stored single phonemes, each of said stored phonemes having associated static duration information;
a normalization module to generate a normalized duration value for each of said constituent syllables, wherein said normalized duration value is generated by dividing the syllable duration by the combined static duration of the corresponding stored phonemes that comprise said constituent syllable;
the duration template for storing the normalized duration value, said template being specified by text grouping feature, such that the normalized duration value for each constituent syllable having a specific grouping feature is contained in the associated duration template. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
a) “
L”
or “
R”
or “
Y”
or “
W”
followed by a vowel,b) a vowel followed by “
L”
or “
R”
or “
N”
or “
M”
or “
NG”
,c) a vowel and “
R”
followed by “
L”
,d) a vowel and “
L”
followed by “
R”
,e) “
L”
followed by “
M”
or “
N”
, andf) two successive vowels.
-
-
11. A method of generating a duration template from a plurality of input words, the method comprising the steps of:
-
segmenting each of said input words into input phonemes;
grouping the input phonemes into constituent syllables having an associated syllable duration;
clustering the input phonemes into input phoneme pairs and input single phonemes;
retrieving static duration information associated with stored phonemes in a global static table, wherein the stored phonemes correspond to the input phonemes that constitute the constituent syllable;
generating a normalized duration value by dividing the syllable duration by the combined static duration of the stored phonemes corresponding to the input phonemes that constitute the constituent syllable; and
storing the normalized duration value in the duration template. - View Dependent Claims (12, 13, 14, 15)
assigning a grouping feature to each of said constituent syllables; and
specifying each of said duration templates by grouping feature, such that the normalized duration value for each constituent syllable having a specific grouping feature is contained in the associated duration template.
-
-
13. The method of claim 11 further comprising the steps of:
-
assigning grouping features to the constituent syllables; and
storing the input words and constituent syllables with associated grouping features in a word database.
-
-
14. The method of claim 11 wherein the step of clustering the input phonemes into input phoneme pairs and input single phonemes further comprises the steps of;
-
searching the constituent syllable from left to right;
selecting the input phonemes in the constituent syllable that equate to a targeted combination; and
clustering the selected input phonemes into an input phoneme pair.
-
-
15. The method of claim 14 further including the steps of:
-
searching the constituent syllable from right to left;
selecting the input phonemes in the constituent syllable that equate to the targeted combination; and
clustering the selected input phonemes into an input phoneme pair.
-
-
16. A method of de-normalizing duration data contained in a duration template, the method comprising the steps of:
-
providing a target word to be synthesized by a text-to-speech system;
segmenting each of said input words into input phonemes;
grouping the input phonemes into constituent syllables having an associated syllable duration clustering the input phonemes into input phoneme pairs and input single phonemes;
retrieving static duration information associated with stored phonemes in a global static table, wherein the stored phonemes correspond to the input phonemes that constitute each of the constituent syllables;
retrieving a normalized duration value for each of the constituent syllables from an associated duration template; and
generating a de-normalized syllable duration by multiplying the normalized duration value for each constituent syllable by the combined static duration of the stored phonemes corresponding to the input phonemes that constitute that constituent syllable. - View Dependent Claims (17, 18)
sending the de-normalized syllable duration to a prosody module so that synthesized speech having natural sounding prosody will be transmitted.
-
-
18. The method of claim 16 further comprising the step of:
retrieving grouping features associated with the target word from a word dictionary.
Specification