Distributed synthetic speech generation

US 6,510,413 B1
Filed: 06/29/2000
Issued: 01/21/2003
Est. Priority Date: 06/29/2000
Status: Active Grant

First Claim

Patent Images

1. A method of synthesizing speech comprising:

receiving an intermediate form representation of a text file, the intermediate form representation containing a pronunciation-resolved re-representation of the text file, the intermediate form representation including acoustic units that represent individual vocal sounds sequences and prosodic modifiers that specify modifications of the vocal sounds represented by the acoustic units;

rendering the intermediate form representation into an audio signal based on the acoustic units and prosodic modifiers; and

transmitting the audio signal to a speaker.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Text that is to be synthesized into human speech is first converted into an intermediate form representation that describe the acoustic-prosodic resolution of the spoken version of the text. The intermediate form can be generated manually, or by an intermediate form generation program at a server computer, and later downloaded to client computers at their request. The client computers synthesize the intermediate form representation to audio for their users using a relatively simple speech rendering program.

Citations

30 Claims

1. A method of synthesizing speech comprising:
- receiving an intermediate form representation of a text file, the intermediate form representation containing a pronunciation-resolved re-representation of the text file, the intermediate form representation including acoustic units that represent individual vocal sounds sequences and prosodic modifiers that specify modifications of the vocal sounds represented by the acoustic units;
  
  rendering the intermediate form representation into an audio signal based on the acoustic units and prosodic modifiers; and
  
  transmitting the audio signal to a speaker.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the intermediate form representation of the text file is received as a streaming file and rendering the intermediate form representation into the audio signal is started before the intermediate form representation has been fully received.
  - 3. The method of claim 1, wherein the intermediate form representation is contained in a series of extensible markup language (XML) tags.
  - 4. The method of claim 1, wherein the prosodic modifiers include at least one of fundamental pitch modification descriptor units, duration modification descriptor units, acoustic gain modification descriptor units, and spectral tilt modification descriptor units.

5. A data structure stored on a computer readable medium and that describes a pronunciation-resolved representation of a text file, the data structure comprising:
- a plurality of acoustic units, each acoustic unit representing a vocally produced sound sequence;
  
  duration modification descriptor units, each of the duration modification descriptor units corresponding to or contained within at least one of the plurality of acoustic units and specifying a time duration of the corresponding acoustic unit; and
  
  fundamental pitch modification descriptor units, each of the fundamental pitch modification descriptor units corresponding to at least one of the plurality of acoustic units and specifying a target frequency of a fundamental pitch used to produce the acoustic unit.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The data structure of claim 5, further comprising:
7. The data structure of claim 6, further comprising:
- spectral tilt modification descriptor units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a formant spectral envelope of the corresponding acoustic unit.
8. The data structure of claim 5, wherein the acoustic units and the duration modification units are contained in a series of extensible markup language (XML) tags.
9. The data structure of claim 5, wherein the acoustic units are phonemes.

10. A method of enabling a remote client device to synthesize speech, the method comprising:
- receiving a text file;
  
  separating the text file into a series of acoustic units that represent individual vocal sounds;
  
  associating duration modification descriptor units with the acoustic units, each of the duration modification descriptor units corresponding to at least one of the plurality of acoustic units and specifying a time duration of the corresponding acoustic unit; and
  
  transmitting the acoustic units and the associated duration modification descriptor units to the remote client device.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The method of claim 10, further comprising:
12. The method of claim 10, further comprising:
- associating acoustic gain modification descriptor units with the acoustic units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
13. The method of claim 10, further comprising:
- associating spectral tilt modification descriptor units with the acoustic units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a formant spectral envelope of the corresponding acoustic unit.
14. The method of claim 10, wherein the acoustic units and the duration modification units are contained in a series of extensible markup language (XML) tags.
15. The method of claim 10, wherein the acoustic units are phonemes.

16. A computer readable medium containing computer instructions that when executed by a processor cause the processor to synthesize speech, the speech synthesis comprising:
- receiving an intermediate form representation of a text file, the intermediate form representation containing a pronunciation-resolved re-representation of the text file, the intermediate form representation including acoustic units that represent individual vocal sounds and prosodic modifiers that specify modifications of the vocal sounds represented by the acoustic units;
  
  rendering the intermediate form representation into an audio signal based on the acoustic units and prosodic modifiers; and
  
  transmitting the audio signal to a speaker.
- View Dependent Claims (17, 18, 19)
- - 17. The computer readable medium of claim 16, wherein the intermediate form representation of the text file is received as a streaming file and rendering the intermediate form representation into the audio signal is started before the intermediate form representation has been fully received.
  - 18. The computer readable medium of claim 16, wherein the intermediate form representation is contained in a series of extensible markup language (XML) tags.
  - 19. The computer readable medium of claim 16, wherein the prosodic modifiers include at least one of fundamental pitch modification descriptor units, duration modification descriptor units, acoustic gain modification descriptor units, and spectral tilt modification descriptor units.

20. A computer readable medium containing computer instructions that when executed by a processor cause the processor to perform acts enabling a remote client device to synthesize speech, comprising:
- receiving a text file;
  
  separating the text file into a series of acoustic units that represent individual vocal sounds;
  
  associating duration modification descriptor units with the acoustic units, each of the duration modification descriptor unit corresponding to at least one of the plurality of acoustic units and specifying a time duration of the corresponding acoustic unit; and
  
  transmitting the acoustic units and the associated duration modification descriptor units to the remote client device.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The computer readable medium of claim 20, further including instructions that cause the processor to perform acts comprising:
22. The computer readable medium of claim 20, further including instructions that cause the processor to perform acts comprising:
- associating acoustic gain modification descriptor units with the acoustic units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
23. The computer readable medium of claim 20, further including instructions that cause the processor to perform acts comprising:
- associating spectral tilt modification descriptor units with the acoustic units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a format spectral envelope of the corresponding acoustic unit.
24. The computer readable medium of claim 20, wherein the acoustic units and the duration modification units are contained in a series of extensible markup language (XML) tags.
25. The computer readable medium of claim 20, wherein the acoustic units are phonemes.

26. A computing device comprising:
- a processor;
  
  a computer memory coupled to the processor, the computer memory including a speech rendering program, the speech rendering program configured to receive a pronunciation resolved intermediate from representation of a text file that is to be converted into speech, the speech rendering program converting the pronunciation resolved intermediate representation into a digital audio file; and
  
  a speaker coupled to the computer memory, the speaker receiving and playing the audio file.
- View Dependent Claims (27, 28, 29, 30)
- - 27. The computing device of claim 26, wherein the intermediate form representation includes:
28. The computing device of claim 27, wherein the intermediate form representation includes:
- acoustic gain modification descriptor units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
29. The computing device of claim 28, wherein the intermediate form representation includes:
- acoustic gain modification descriptor units, each of the acoustic gain modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying an amplitude gain that is to be applied to the corresponding acoustic unit.
30. The computing device of claim 29, wherein the intermediate form representation includes:
- spectral tilt modification descriptor units, each of the spectral tilt modification descriptor units corresponding to the at least one of the plurality of acoustic units and specifying a target slope of a formant spectral envelope of the corresponding acoustic unit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Walker, Mark R.
Primary Examiner(s)
MCFADDEN, SUSAN IRIS

Application Number

US09/605,885
Time in Patent Office

936 Days
Field of Search

704/258, 704/260, 704/270, 704/272
US Class Current

704/258
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 15/30   Distributed recognition, e....

H04M 2201/60   Medium conversion

H04M 3/487   Arrangements for providing ...

H04M 3/4938   comprising a voice browser ...

Distributed synthetic speech generation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed synthetic speech generation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links