Speech synthesis employing prosody templates
First Claim
1. An apparatus for generating synthesized speech from a text of input words, comprising:
- a word dictionary containing information about a plurality of stored words, wherein said information identifies a stress pattern associated with each of said stored words;
a text processor that generates phonemic representations of said input words using said word dictionary to identify the stress pattern of said input words;
a prosody module having a database of standarized templates containing prosody information accessed via a stress pattern and a number of syllables, wherein said prosody information is normalized and parameterized;
a sound generation module that denormalizes and converts said standardized templates for applying to said phonemic representation; and
denormalizing said template via a sound generation module, said denormalizing shifts said template to a height that fits said frame sentence pitch contour.
2 Assignments
0 Petitions
Accused Products
Abstract
Prosody templates, constructed during system design, store intonation (F0) and duration information based on syllabic stress patterns for the target word. The prosody templates are constructed so that words exhibiting the same stress pattern will be assigned the same prosody template. The prosody template information is preferably stored in a normalized form to reduce noise level in the statistical measures. The synthesizer uses a word dictionary that specifies the stress patterns associated with each stored word. These stress patterns are used to access the prosody template database. F0 and duration information is then extracted from the selected template, de-normalized and applied to the phonemic information to produce a natural human-sounding prosody in the synthesized output.
-
Citations
12 Claims
-
1. An apparatus for generating synthesized speech from a text of input words, comprising:
-
a word dictionary containing information about a plurality of stored words, wherein said information identifies a stress pattern associated with each of said stored words;
a text processor that generates phonemic representations of said input words using said word dictionary to identify the stress pattern of said input words;
a prosody module having a database of standarized templates containing prosody information accessed via a stress pattern and a number of syllables, wherein said prosody information is normalized and parameterized;
a sound generation module that denormalizes and converts said standardized templates for applying to said phonemic representation; and
denormalizing said template via a sound generation module, said denormalizing shifts said template to a height that fits said frame sentence pitch contour.
-
-
2. A method for training a prosody template using human speech, comprising:
-
segmenting words of a sentence into phonemes associated with syllables of said words;
assigning stress levels to said syllables;
grouping said words according to said stress levels thereby forming stress pattern groups;
adjusting intonation data associated with each one of said stress pattern groups thereby providing normalized data;
adjusting a pitch shift of said normalized data thereby providing transformed data; and
storing said transformed data in a prosody database as a template. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
forming an elevation point for said target word, said elevation point based on linear regression of said transformed data and a word end-boundary.
-
-
7. The method of claim 4 wherein said elevation point is adjusted as a common reference point.
-
8. The method of claim 7 producing a constant representing said denormalizing based on the regression-line coefficient of said frame sentence pitch contour.
-
9. The method of claim 7 further comprises the step of:
accessing a duration template operably permitting denormalization of said duration information thereby associating a time with each of said syllables.
-
10. The method of claim 8 further comprises the step of:
transforming log-domain values of said duration template into linear values.
-
11. The method of claim 9 further comprises the step of:
resampling each of said syllable segments of the template for a fixed duration such that the total duration of (each) corresponds to the denormalized time values, whereby the intonation contour is associated with a physical timeline.
-
12. The method of claim 10 further comprises the steps of:
-
storing duration information as ratios of phoneme values to globally determined duration values, said globally determined duration values are based on mean values across the entire training corpus;
per-syllable values based on a sum of the observed phoneme; and
said prosody template populated with said per-syllable versus global ratios operable permitting computation of an actual duration of said each syllable.
-
Specification