Method and apparatus for controlling a speech synthesis system to provide multiple styles of speech
First Claim
1. A method for synthesizing a voice signal based on a predetermined voice control information stream, the voice signal selectively synthesized to have a particular prosodic style, the method comprising the steps of:
- analyzing said predetermined voice control information stream to identify one or more portions thereof for prosody control;
selecting one or more prosody control templates based on the particular prosodic style selected for said voice signal synthesis;
applying said one or more selected prosody control templates to said one or more identified portions of said predetermined voice control information stream, thereby generating a stylized voice control information stream; and
synthesizing said voice signal based on said stylized voice control information stream so that said synthesized voice signal has said particular prosodic style, wherein said one or more prosody control templates comprise tag templates which are selected from a tag template database and wherein said step of applying said selected prosody control templates to said identified portions of said predetermined voice control information stream comprises the steps of;
expanding each of said tag templates into one or more tags;
converting said one or more tags into a time series of prosodic features; and
generating said stylized voice control information stream based on said time series of prosodic features.
4 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for synthesizing speech from text whereby the speech may be generated in a manner so as to effectively convey a particular, selectable style. Repeated patterns of one or more prosodic features—such as, for example, pitch, amplitude, spectral tilt, and/or duration—occurring at characteristic locations in the synthesized speech, are advantageously used to convey a particular chosen style. For example, one or more of such feature patterns may be used to define a particular speaking style, and an illustrative text-to-speech system then makes use of such a defined style to adjust the specified parameter or parameters of the synthesized speech in a non-uniform manner (i.e., in accordance with the defined feature pattern or patterns).
102 Citations
16 Claims
-
1. A method for synthesizing a voice signal based on a predetermined voice control information stream, the voice signal selectively synthesized to have a particular prosodic style, the method comprising the steps of:
-
analyzing said predetermined voice control information stream to identify one or more portions thereof for prosody control;
selecting one or more prosody control templates based on the particular prosodic style selected for said voice signal synthesis;
applying said one or more selected prosody control templates to said one or more identified portions of said predetermined voice control information stream, thereby generating a stylized voice control information stream; and
synthesizing said voice signal based on said stylized voice control information stream so that said synthesized voice signal has said particular prosodic style, wherein said one or more prosody control templates comprise tag templates which are selected from a tag template database and wherein said step of applying said selected prosody control templates to said identified portions of said predetermined voice control information stream comprises the steps of;
expanding each of said tag templates into one or more tags;
converting said one or more tags into a time series of prosodic features; and
generating said stylized voice control information stream based on said time series of prosodic features. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for synthesizing a voice signal based on a predetermined voice control information stream, the voice signal selectively synthesized to have a particular prosodic style, the apparatus comprising:
-
means for analyzing said predetermined voice control information stream to identify one or more portions thereof for prosody control;
means for selecting one or more prosody control templates based on the particular prosodic style selected for said voice signal synthesis;
means for applying said one or more selected prosody control templates to said one or more identified portions of said predetermined voice control information stream, thereby generating a stylized voice control information stream; and
means for synthesizing said voice signal based on said stylized voice control information stream so that said synthesized voice signal has said particular prosodic style, wherein said one or more prosody control templates comprise tag templates which are selected from a tag template database and wherein said means for applying said selected prosody control templates to said identified portions of said predetermined voice control information stream comprises;
means for expanding each of said tag templates into one or more tags;
means for converting said one or more tags into a time series of prosodic features; and
means for generating said stylized voice control information stream based on said time series of prosodic features. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification