Distributed synthetic speech generation
First Claim
Patent Images
1. A method of synthesizing speech comprising:
- receiving an intermediate form representation of a text file, the intermediate form representation containing a pronunciation-resolved re-representation of the text file, the intermediate form representation including acoustic units that represent individual vocal sounds sequences and prosodic modifiers that specify modifications of the vocal sounds represented by the acoustic units;
rendering the intermediate form representation into an audio signal based on the acoustic units and prosodic modifiers; and
transmitting the audio signal to a speaker.
1 Assignment
0 Petitions
Accused Products
Abstract
Text that is to be synthesized into human speech is first converted into an intermediate form representation that describe the acoustic-prosodic resolution of the spoken version of the text. The intermediate form can be generated manually, or by an intermediate form generation program at a server computer, and later downloaded to client computers at their request. The client computers synthesize the intermediate form representation to audio for their users using a relatively simple speech rendering program.
-
Citations
30 Claims
-
1. A method of synthesizing speech comprising:
-
receiving an intermediate form representation of a text file, the intermediate form representation containing a pronunciation-resolved re-representation of the text file, the intermediate form representation including acoustic units that represent individual vocal sounds sequences and prosodic modifiers that specify modifications of the vocal sounds represented by the acoustic units;
rendering the intermediate form representation into an audio signal based on the acoustic units and prosodic modifiers; and
transmitting the audio signal to a speaker. - View Dependent Claims (2, 3, 4)
-
-
5. A data structure stored on a computer readable medium and that describes a pronunciation-resolved representation of a text file, the data structure comprising:
-
a plurality of acoustic units, each acoustic unit representing a vocally produced sound sequence;
duration modification descriptor units, each of the duration modification descriptor units corresponding to or contained within at least one of the plurality of acoustic units and specifying a time duration of the corresponding acoustic unit; and
fundamental pitch modification descriptor units, each of the fundamental pitch modification descriptor units corresponding to at least one of the plurality of acoustic units and specifying a target frequency of a fundamental pitch used to produce the acoustic unit. - View Dependent Claims (6, 7, 8, 9)
acoustic gain modification descriptor units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
-
-
7. The data structure of claim 6, further comprising:
spectral tilt modification descriptor units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a formant spectral envelope of the corresponding acoustic unit.
-
8. The data structure of claim 5, wherein the acoustic units and the duration modification units are contained in a series of extensible markup language (XML) tags.
-
9. The data structure of claim 5, wherein the acoustic units are phonemes.
-
10. A method of enabling a remote client device to synthesize speech, the method comprising:
-
receiving a text file;
separating the text file into a series of acoustic units that represent individual vocal sounds;
associating duration modification descriptor units with the acoustic units, each of the duration modification descriptor units corresponding to at least one of the plurality of acoustic units and specifying a time duration of the corresponding acoustic unit; and
transmitting the acoustic units and the associated duration modification descriptor units to the remote client device. - View Dependent Claims (11, 12, 13, 14, 15)
associating fundamental pitch modification units with the acoustic units, each of the fundamental pitch modification descriptor units corresponding to at least one of the plurality of acoustic units and specifying a target frequency of a fundamental pitch used to produce the acoustic unit.
-
-
12. The method of claim 10, further comprising:
associating acoustic gain modification descriptor units with the acoustic units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
-
13. The method of claim 10, further comprising:
associating spectral tilt modification descriptor units with the acoustic units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a formant spectral envelope of the corresponding acoustic unit.
-
14. The method of claim 10, wherein the acoustic units and the duration modification units are contained in a series of extensible markup language (XML) tags.
-
15. The method of claim 10, wherein the acoustic units are phonemes.
-
16. A computer readable medium containing computer instructions that when executed by a processor cause the processor to synthesize speech, the speech synthesis comprising:
-
receiving an intermediate form representation of a text file, the intermediate form representation containing a pronunciation-resolved re-representation of the text file, the intermediate form representation including acoustic units that represent individual vocal sounds and prosodic modifiers that specify modifications of the vocal sounds represented by the acoustic units;
rendering the intermediate form representation into an audio signal based on the acoustic units and prosodic modifiers; and
transmitting the audio signal to a speaker. - View Dependent Claims (17, 18, 19)
-
-
20. A computer readable medium containing computer instructions that when executed by a processor cause the processor to perform acts enabling a remote client device to synthesize speech, comprising:
-
receiving a text file;
separating the text file into a series of acoustic units that represent individual vocal sounds;
associating duration modification descriptor units with the acoustic units, each of the duration modification descriptor unit corresponding to at least one of the plurality of acoustic units and specifying a time duration of the corresponding acoustic unit; and
transmitting the acoustic units and the associated duration modification descriptor units to the remote client device. - View Dependent Claims (21, 22, 23, 24, 25)
associating fundamental pitch modification units with the acoustic units, each of the fundamental pitch modification descriptor units corresponding to at least one of the plurality of acoustic units and specifying a target frequency of a fundamental pitch used to produce the acoustic unit.
-
-
22. The computer readable medium of claim 20, further including instructions that cause the processor to perform acts comprising:
associating acoustic gain modification descriptor units with the acoustic units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
-
23. The computer readable medium of claim 20, further including instructions that cause the processor to perform acts comprising:
associating spectral tilt modification descriptor units with the acoustic units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a format spectral envelope of the corresponding acoustic unit.
-
24. The computer readable medium of claim 20, wherein the acoustic units and the duration modification units are contained in a series of extensible markup language (XML) tags.
-
25. The computer readable medium of claim 20, wherein the acoustic units are phonemes.
-
26. A computing device comprising:
-
a processor;
a computer memory coupled to the processor, the computer memory including a speech rendering program, the speech rendering program configured to receive a pronunciation resolved intermediate from representation of a text file that is to be converted into speech, the speech rendering program converting the pronunciation resolved intermediate representation into a digital audio file; and
a speaker coupled to the computer memory, the speaker receiving and playing the audio file. - View Dependent Claims (27, 28, 29, 30)
a plurality of acoustic units, each acoustic unit representing a vocally produced sound; and
duration modification descriptor units, each of the duration modification descriptor units corresponding to at least one of the plurality of acoustic units and specifying a time duration of the corresponding acoustic unit.
-
-
28. The computing device of claim 27, wherein the intermediate form representation includes:
acoustic gain modification descriptor units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
-
29. The computing device of claim 28, wherein the intermediate form representation includes:
acoustic gain modification descriptor units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
-
30. The computing device of claim 29, wherein the intermediate form representation includes:
spectral tilt modification descriptor units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a formant spectral envelope of the corresponding acoustic unit.
Specification