Voice font speaker and prosody interpolation
First Claim
1. A method allowing computer-generated speech to be rendered with a multi-voice font that is different than source voice fonts used to generate the multi-voice font, the method comprising the acts of:
- loading the source voice fonts;
assigning weights to characteristics of each source voice font;
obtaining text to be rendered as the computer-generated speech;
predicting characteristic values for the text for each source voice font using at least one characteristic prediction model associated with each source voice font;
merging the predicted characteristic values with the corresponding weights to produce interpolated characteristic values; and
rendering the text as computer-generated speech having the interpolated characteristic values.
2 Assignments
0 Petitions
Accused Products
Abstract
Multi-voice font interpolation is provided. A multi-voice font interpolation engine allows the production of computer generated speech with a wide variety of speaker characteristics and/or prosody by interpolating speaker characteristics and prosody from existing fonts. Using prediction models from multiple voice fonts, the multi-voice font interpolation engine predicts values for the parameters that influence speaker characteristics and/or prosody for the phoneme sequence obtained from the text to spoken. For each parameter, additional parameter values are generated by a weighted interpolation from the predicted values. Modifying an existing voice font with the interpolated parameters changes the style and/or emotion of the speech while retaining the base sound qualities of the original voice. The multi-voice font interpolation engine allows the speaker characteristics and/or prosody to be transplanted from one voice font to another or entirely new speaker characteristics and/or prosody to be generated for an existing voice font.
20 Citations
20 Claims
-
1. A method allowing computer-generated speech to be rendered with a multi-voice font that is different than source voice fonts used to generate the multi-voice font, the method comprising the acts of:
-
loading the source voice fonts; assigning weights to characteristics of each source voice font; obtaining text to be rendered as the computer-generated speech; predicting characteristic values for the text for each source voice font using at least one characteristic prediction model associated with each source voice font; merging the predicted characteristic values with the corresponding weights to produce interpolated characteristic values; and rendering the text as computer-generated speech having the interpolated characteristic values. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 18)
-
-
11. A system for generating a multi-voice font from a plurality of source voice fonts, the system comprising:
-
a phoneme sequencer for parsing input text into a sequence of phonemes; a predictor operable to predict values of voice font characteristics for the phonemes for each source voice font of the plurality of source voice fonts using at least one characteristic model associated with each source voice font; a weight selector operable to assign a duration weight, a f0 weight, and a spectrum weight to each source voice font, the duration weight, the f0 weight, and the spectrum weight determining the relative contribution of the voice font characteristics predicted for the corresponding source voice font to the multi-voice font; an interpolator operable to merge the predicted voice font characteristics with the weights to produce the multi-voice font having voice font characteristics derived from the source voice fonts; and a voice encoder operable to render the input text as computer-generated speech using the multi-voice font, the computer-generated speech having the voice font characteristics derived from the source voice fonts. - View Dependent Claims (12, 13, 14)
-
-
15. A tangible computer storage medium containing computer executable instructions which, when executed by a computer, perform a method of generating a multi-voice font for rendering text as computer-generated speech, the method comprising the acts of:
-
obtaining the text to be rendered as the computer-generated speech; loading the source voice fonts; predicting duration values, voiced/unvoiced probability values, f0 values, and spectral trajectory values for the text for each source voice font using at least one characteristic prediction model associated with each source voice font; assigning a duration weight, a f0 weight, a spectrum weight to each source voice font; merging the duration values predicted with each source voice font with the duration weight to produce interpolated duration values, the duration weight for each source voice font representing the percentage that the source voice font contributes to the interpolated duration values; merging the f0 values predicted with each source voice font with the f0 weight given to that source voice font to produce interpolated f0 values, the f0 weight for each source voice font representing the percentage that the source voice font contributes to the interpolated f0 values; merging the voiced/unvoiced decision values and the spectral trajectory values predicted with each source voice font with the spectrum weight given to that source voice font to produce interpolated voiced/unvoiced probability values and interpolated spectral trajectory values, the spectrum weight for each source voice font representing the percentage that the source voice font contributes to the interpolated voiced/unvoiced probability values and interpolated spectral trajectory values; and rendering the text as computer-generated speech having the interpolated duration values, interpolated f0 values, interpolated voiced/unvoiced probability values, and interpolated spectral trajectory values. - View Dependent Claims (16, 17, 19, 20)
-
Specification