Method and apparatus for converting voice characteristics of synthesized speech
First Claim
1. A text-to-speech synthesis system for producing audible synthesized human speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics from digital characters comprising:
- text reader means adapted to be exposed to text material and responsive thereto for generating information signals indicative of the substantive content thereof;
converter means for receiving said information signals from said text reader means and generating digital character signals representative thereof;
means for receiving said digital character signals from said converter means;
memory means storing digital speech data including digital speech instructional rules and digital speech data representative of sound unit code signals;
data processing means for searching said digital speech data stored in said memory means to locate digital speech data representative of a sound unit code corresponding to said digital character signals received from said converter means;
speech memory means storing digital speech data representative of a plurality of sound units;
concatenating controller means operably coupled to said speech memory means for selectively combining digital speech data representative of a plurality of sound units in a serial sequence to provide concatenated digital speech data representative of a word;
speech synthesis controller means coupled to said data processing means and to said speech memory means for receiving digital speech signals representative of a sound unit code corresponding to said digital character signals and selectively accessing digital speech data representative of sound units corresponding to said sound unit code from said speech memory means;
speech synthesizer means operably coupled to said concatenating controller means and said speech synthesis controller means for receiving selectively accessed serial sequences of digital speech data from said concatenating controller means to provide audio signals corresponding thereto and representative of synthesized human speech;
voice characteristics conversion means interposed between said concatenating controller means and said speech synthesizer means and being coupled therebetween independently of the coupling between said concatenating controller means and said speech synthesizer means, said voice characteristics conversion means being operably coupled to said speech synthesis controller means and being responsive thereto to selectively modify the voice characteristics of said serially sequenced digital speech data output from said concatenating controller means, said voice characteristics conversion means includingmeans for making a voice character selection of the synthesized speech to be derived from the digital speech data as selectively accessed from said speech memory means so as to simulate a voice sound differing in character with respect to the voice sound of the synthesized speech from the digital speech data of said speech memory means in the voice characteristics pertaining to the apparent age and/or sex of the speaker;
said digital speech data as selectively accessed from said speech memory means having a predetermined pitch period, a predetermined vocal tract model and a predetermined speech rate;
speech parameter control means for modifying the pitch period and speech rate in response to inputs from said voice character selection means to produce a modified pitch period and a modified speech rate, said speech parameter control means including sample rate control circuit means responsive to inputs from said voice character selection means for adjusting the sampling period of said digital speech data selectively accessed from said speech memory means in a manner altering the digital speech formants contained therein to a preselected degree and providing adjusted sampling period signals as an output;
speech data reconstructing means operably associated with said speech parameter control means for combining the modified pitch period and the modified speech rate with the predetermined vocal tract model into a synthesized speech data format of speech data modified with respect to the original speech data from said speech memory means;
said speech synthesizer means being coupled to said speech data reconstructing means and to the output of said sample rate control circuit means for receiving the modified speech data and the adjusted sampling period signals therefrom in providing said audio signals representative of human speech from the modified speech data; and
audio means coupled to said speech synthesizer means for converting said audio signals into audible synthesized human speech in any one of a plurality of voice sound from said digital speech data stored in said speech memory means as determined by said voice characteristics conversion means.
1 Assignment
0 Petitions
Accused Products
Abstract
Method and apparatus for converting voice characteristics of synthesized speech from a single applied source of synthesized speech in a manner obtaining modified voice characteristics pertaining to the apparent age and/or sex of the speaker. The apparatus is capable of altering the voice characteristics of synthesized speech to obtain modified voice sounds simulating child-like, teenage, adult, aged and sexual preference characteristics by control of vocal track parameters including pitch period, vocal tract model, and speech data rate. A source of synthesized speech having a predetermined pitch period, a predetermined vocal tract model, and a predetermined speech rate is separated into the respective speech parameters. The values of pitch, the speech data frame length, and the speech data rate are then varied in a preselected manner to modify the voice characteristics of the synthesized speech from the source thereof. Thereafter, the changed speech data parameters are re-combined into a modified synthesized speech data format having different voice characteristics with respect to the synthesized speech from the source, and an audio signal representative of human speech is generated from the modified synthesized speech data format from which audible synthesized speech may be generated.
125 Citations
29 Claims
-
1. A text-to-speech synthesis system for producing audible synthesized human speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics from digital characters comprising:
-
text reader means adapted to be exposed to text material and responsive thereto for generating information signals indicative of the substantive content thereof; converter means for receiving said information signals from said text reader means and generating digital character signals representative thereof; means for receiving said digital character signals from said converter means; memory means storing digital speech data including digital speech instructional rules and digital speech data representative of sound unit code signals; data processing means for searching said digital speech data stored in said memory means to locate digital speech data representative of a sound unit code corresponding to said digital character signals received from said converter means; speech memory means storing digital speech data representative of a plurality of sound units; concatenating controller means operably coupled to said speech memory means for selectively combining digital speech data representative of a plurality of sound units in a serial sequence to provide concatenated digital speech data representative of a word; speech synthesis controller means coupled to said data processing means and to said speech memory means for receiving digital speech signals representative of a sound unit code corresponding to said digital character signals and selectively accessing digital speech data representative of sound units corresponding to said sound unit code from said speech memory means; speech synthesizer means operably coupled to said concatenating controller means and said speech synthesis controller means for receiving selectively accessed serial sequences of digital speech data from said concatenating controller means to provide audio signals corresponding thereto and representative of synthesized human speech; voice characteristics conversion means interposed between said concatenating controller means and said speech synthesizer means and being coupled therebetween independently of the coupling between said concatenating controller means and said speech synthesizer means, said voice characteristics conversion means being operably coupled to said speech synthesis controller means and being responsive thereto to selectively modify the voice characteristics of said serially sequenced digital speech data output from said concatenating controller means, said voice characteristics conversion means including means for making a voice character selection of the synthesized speech to be derived from the digital speech data as selectively accessed from said speech memory means so as to simulate a voice sound differing in character with respect to the voice sound of the synthesized speech from the digital speech data of said speech memory means in the voice characteristics pertaining to the apparent age and/or sex of the speaker; said digital speech data as selectively accessed from said speech memory means having a predetermined pitch period, a predetermined vocal tract model and a predetermined speech rate; speech parameter control means for modifying the pitch period and speech rate in response to inputs from said voice character selection means to produce a modified pitch period and a modified speech rate, said speech parameter control means including sample rate control circuit means responsive to inputs from said voice character selection means for adjusting the sampling period of said digital speech data selectively accessed from said speech memory means in a manner altering the digital speech formants contained therein to a preselected degree and providing adjusted sampling period signals as an output; speech data reconstructing means operably associated with said speech parameter control means for combining the modified pitch period and the modified speech rate with the predetermined vocal tract model into a synthesized speech data format of speech data modified with respect to the original speech data from said speech memory means; said speech synthesizer means being coupled to said speech data reconstructing means and to the output of said sample rate control circuit means for receiving the modified speech data and the adjusted sampling period signals therefrom in providing said audio signals representative of human speech from the modified speech data; and audio means coupled to said speech synthesizer means for converting said audio signals into audible synthesized human speech in any one of a plurality of voice sound from said digital speech data stored in said speech memory means as determined by said voice characteristics conversion means.
-
-
2. A method of converting voice characteristics of synthesized speech to obtain modified synthesized speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics from a single applied source of synthesized speech, said method comprising:
-
providing a source of synthesized speech in the form of digital speech data subject to speech synthesization using a predetermined sample period comprising a known number of task-accomplishing time increments; adjusting the sampling period of the digital speech data from said source of synthesized speech in a manner altering the digital speech formants contained therein to a preselected degree; producing modified digital speech data including the adjusted sampling period and having modified voice characteristics as compared to the synthesized speech from said source; generating audio signals representative of human speech from the modified digital speech data; and converting said audio signals into audible synthesized human speech having different voice characteristics from the synthesized human speech which would have been obtained from said source of synthesized speech. - View Dependent Claims (3, 4, 5)
-
-
6. A method of converting voice characteristics of synthesized speech to obtain modified synthesized speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics from a single applied source of synthesized speech, said method comprising:
-
providing a source of synthesized speech as digital speech data including a predetermined pitch period, a predetermined vocal tract model, and a predetermined speech rate; separating the pitch period, vocal tract model, and speech rate from each other to define said pitch period, vocal tract model, and speech rate as respective independent speech synthesis factors; adjusting the sampling period associated with said digital speech data from said source of synthesized speech in a manner altering the digital speech formants contained therein to a preselected degree; modifying the predetermined pitch period and the predetermined speech rate independently of each other and in respective response to the adjusted sampling period in a preselected manner to modify the voice characteristics of the synthesized speech from said source; re-combining the modified pitch period, the modified speech rate, and the predetermined vocal tract model into a synthesized speech data format of digital speech data modified with respect to the synthesized speech from said source; generating audio signals representative of human speech from the modified digital speech data in conjunction with the adjusted sampling period; and converting said audio signals into audible synthesized human speech having different voice characteristics from the synthesized human speech which would have been obtained from said source of synthesized speech. - View Dependent Claims (7, 8, 9, 21, 22, 23, 24)
-
-
10. Apparatus for converting voice characteristics of synthesized speech to obtain modified synthesized speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics from a single applied source of synthesized speech, said apparatus comprising:
-
voice character conversion controller means for receiving digital speech data from which synthesized speech may be derived from a source thereof, said digital speech data having a predetermined pitch period, a predetermined vocal tract model and a predetermined speech rate, said voice character conversion controller means having means for selecting digital speech data representative of at least a portion of a word, and means for making a voice character selection of the synthesized speech to be derived from the digital speech data received from said source simulating a voice sound differing in character with respect to the voice sound of the synthesized speech from said source in the voice characteristics pertaining to the apparent age and/or sex of the speaker; speech parameter control means for modifying the pitch period and speech rate in response to inputs from said voice character conversion controller means as determined by said voice character selection means thereof to produce a modified pitch period and a modified speech rate; speech data reconstructing means operably associated with said speech parameter control means for combining the modified pitch period and the modified speech rate with the predetermined vocal tract model into a synthesized speech data format of speech data modified with respect to the original speech data from said source; speech synthesizer means coupled to said speech data reconstructing means for receiving the modified speech data therefrom and generating audio signals representative of human speech from the modified speech data; and audio means coupled to said speech synthesizer means for converting said audio signals into synthesized human speech having different voice characteristics from the synthesized speech which would have been obtained from the source of synthesized speech. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A speech synthesis system comprising:
-
memory means having digital speech data stored therein from which synthesized speech having predetermined voice characteristics may be derived; speech synthesizer means operably connected to said memory means for receiving digital speech data therefrom to generate audio signals from which audible synthesized human speech may be provided; controller means operably associated with said memory means and said speech synthesizer means for selectively accessing digital speech data from said memory means to be input to said speech synthesizer means; voice characteristics conversion means interconnected between said memory means and said speech synthesizer means for modifying voice characteristics of the digital speech data selectively accessed from said memory means in response to said controller means; and audio means coupled to the output of said speech synthesizer means for converting said audio signals into audible synthesized human speech having different voice characteristics from the synthesized speech which would have been obtained from said digital speech data stored in said memory means. - View Dependent Claims (17, 18, 26, 27)
-
-
19. A text-to-speech synthesis system for producing audible synthesized human speech from digital characters comprising:
-
means for receiving the digital characters; speech unit rule means for storing encoded speech parameter signals corresponding to the digital characters; rules processor means for searching the speech unit rule means to provide encoded speech parameter signals corresponding to the digital characters; and speech producing means connected to receive the encoded speech parameter signals and to produce audible synthesized human speech therefrom, said speech producing means including voice characteristics conversion means selectively operable to modify the voice characteristics of the encoded speech parameter signals corresponding to the digital characters such that said speech producing means is enabled to provide audible synthesized human speech of any one of a plurality of voice sounds.
-
-
20. A text-to-speech synthesis system for producing audible synthesized human speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics from digital characters comprising:
-
text reader means adapted to be exposed to text material and responsive thereto for generating information signals indicative of the substantive content thereof; converter means for receiving said information signals from said text reader means and generating digital character signals representative thereof; means for receiving said digital character signals from said converter means; memory means storing digital speech data including digital speech instructional rules and digital speech data representative of sound unit code signals; data processing means for searching said digital speech data stored in said memory means to locate digital speech data representative of a sound unit code corresponding to said digital character signals received from said converter means; speech memory means storing digital speech data representative of a plurality of sound units; concatenating controller means operably coupled to said speech memory means for selectively combining digital speech data representative of a plurality of sound units in a serial sequence to provide concatenated digital speech data representative of a word; speech synthesis controller means coupled to said data processing means and to said speech memory means for receiving digital speech signals representative of a sound unit code corresponding to said digital character signals and selectively accessing digital speech data representative of sound units corresponding to said sound unit code from said speech memory means; speech synthesizer means operably coupled to said concatenating controller means and said speech synthesis controller means for receiving selectively accessed serial sequences of digital speech data from said concatenating controller means to provide audio signals corresponding thereto and representative of synthesized human speech; voice characteristics conversion means interposed between said concatenating controller means and said speech synthesizer means and being coupled therebetween independently of the coupling between said concatenating controller means and said speech synthesizer means, said voice characteristics conversion means being operably coupled to said speech synthesis controller means and being responsive thereto to selectively modify the voice characteristics of said serially sequenced digital speech data output from said concatenating controller means; and audio means coupled to said speech synthesizer means for converting said audio signals into audible synthesized human speech in any one of a plurality of voice sounds from said digital speech data stored in said speech memory means as determined by said voice characteristics conversion means. - View Dependent Claims (29)
-
-
25. A speech synthesis system comprising:
-
memory means providing a source of synthesized speech as digital speech data stored therein from which synthesized speech having predetermined voice characteristics may be derived; speech synthesizer means operably connected to said memory means for receiving digital speech data therefrom to generate audio signals from which audible synthesized human speech may be provided; controller means operably associated with said memory means and said speech synthesizer means for selectively accessing digital speech data from said memory means to be input to said speech synthesizer means; voice characteristics conversion means interconnected between said memory means and said speech synthesizer means for modifying voice characteristics of the digital speech data selectively accessed from said memory means in response to said controller means, said voice characteristics conversion means comprising means for making a voice character selection of the synthesized speech to be derived from the digital speech data received from said memory means as selectively accessed in response to said controller means to simulate a voice sound differing in character with respect to the voice sound of the synthesized speech from the digital speech data as selectively accessed from said memory means in the voice characteristics pertaining to the apparent age and/or sex of the speaker; said digital speech data as accessed from said memory means having a predetermined pitch period, a predetermined vocal tract model and a predetermined speech rate; speech parameter control means for modifying the pitch period and speech rate in response to inputs from said voice character selection means to produce a modified pitch period and a modified speech rate, said speech parameter control means including sample rate control circuit means responsive to inputs from said voice character selection means for adjusting the sampling period of said digital speech data as selectively accessed from said memory means in a manner altering the digital speech formants contained therein to a preselected degree and providing adjusted sampling period signals as an output; speech data reconstructing means operably associated with said speech parameter control means for combining the modified pitch period and the modified speech rate with the predetermined vocal tract model into a synthesized speech data format of speech data modified with respect to the original speech data as selectively accessed from said memory means; said speech synthesizer means being coupled to the output of said sample rate control circuit means for receiving said adjusted sampling period signals therefrom as the modified speech data from said speech data reconstructing means is being input thereto in generating said audio signals representative of human speech from the modified speech data; and audio means coupled to the output of said speech synthesizer means for converting said audio signals into audible synthesized human speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics and having different voice characteristics from the synthesized speech which would have been obtained from said digital speech data stored in said memory means.
-
-
28. A text-to-speech synthesis system for producing audible synthesized human speech from digital characters comprising:
-
means for receiving the digital characters; speech unit rule means for storing encoded speech parameter signals corresponding to the digital characters; rules processor means for searching the speech unit rule means to provide encoded speech parameter signals corresponding to the digital characters and in the form of digital speech data from which synthesized speech having predetermined voice characteristics may be derived; voice characteristics conversion means selectively operable to modify the voice characteristics of the encoded speech parameter signals corresponding to the digital characters and comprising means for making a voice character selection of the synthesized speech to be derived from the digital speech data as received from said rules processor means simulating a voice sound differing in character with respect to the voice sound of the synthesized speech from the digital speech data in the voice characteristics pertaining to the apparent age and/or sex of the speaker; said digital speech data having a predetermined pitch period, a predetermined vocal tract model and a predetermined speech rate; speech parameter control means for modifying the pitch period and speech rate in response to inputs from said voice character selection means to produce a modified pitch period and a modified speech rate, said speech parameter control means including sample rate control circuit means responsive to inputs from said voice character selection means for adjusting the sampling period of said digital speech data in a manner altering the digital speech formants contained therein to a preselected degree and providing adjusted sampling period signals as an output; speech data reconstructing means operably associated with said speech parameter control means for combining the modified pitch period and the modified speech rate with the predetermined vocal tract model into a synthesized speech data format of speech data modified with respect to the original speech data as derived from said encoded speech parameter signals; and speech producing means coupled to said speech data reconstructing means for receiving the modified speech data therefrom and to produce audible synthesized human speech from the modified speech data as synthesized human speech of any one of a plurality of voice sounds simulating child-like, adult, aged and sexual characteristics and having different voice characteristics from the synthesized speech which would have been obtained from said encoded speech parameter signals as a source of synthesized speech.
-
Specification