Method and apparatus for speech synthesis using paralinguistic variation

US 8,103,505 B1
Filed: 11/19/2003
Issued: 01/24/2012
Est. Priority Date: 11/19/2003
Status: Active Grant

First Claim

Patent Images

1. A method for producing synthetic speech comprising:

processing received text using a prosody model to produce prosodic features representative of the linguistic meaning of the received text;

generating an acoustic sequence of speech signals that represents the synthesized speech, the acoustic sequence having the prosodic features representative of the processed text;

determining a prior paralinguistic variation that has been applied to the acoustic sequence before a current paralinguistic variation; and

applying the current paralinguistic variation which includes a mathematical transformation to the acoustic sequence overall, wherein the current paralinguistic variation is determined based on the prior paralinguistic variation, wherein the mathematical transformation does not alter the prosodic features representative of the linguistic meaning of the received text, wherein the current paralinguistic variation is applied to change the sound of the generated acoustic sequence of the speech signals.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for speech synthesis in a computer-user interface using random paralinguistic variation is described herein. According to one aspect of the present invention, a method for synthesizing speech comprises generating synthesized speech having certain prosodic features. The synthesized speech is further processed by applying a random paralinguistic variation to the acoustic sequence representing the synthesized speech without altering the linguistic prosodic features. According to one aspect of the present invention, the application of the paralinguistic variation is correlated with a previously applied paralinguistic variation to reflect a gradual change in the computer voice, while still maintaining a random quality.

73 Citations

View as Search Results

62 Claims

1. A method for producing synthetic speech comprising:
- processing received text using a prosody model to produce prosodic features representative of the linguistic meaning of the received text;
  
  generating an acoustic sequence of speech signals that represents the synthesized speech, the acoustic sequence having the prosodic features representative of the processed text;
  
  determining a prior paralinguistic variation that has been applied to the acoustic sequence before a current paralinguistic variation; and
  
  applying the current paralinguistic variation which includes a mathematical transformation to the acoustic sequence overall, wherein the current paralinguistic variation is determined based on the prior paralinguistic variation, wherein the mathematical transformation does not alter the prosodic features representative of the linguistic meaning of the received text, wherein the current paralinguistic variation is applied to change the sound of the generated acoustic sequence of the speech signals.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, further comprisingselecting at least one of the plurality of paralinguistic variations;
    - andapplying the selected paralinguistic variation to the generated speech signals without altering the prosodic features representative of the linguistic meaning of the received text.
  - 3. The method of claim 2, wherein the selected paralinguistic variation comprises a variation in an overall pitch range of the generated acoustic sequence of the speech signals.
  - 4. The method of claim 3, wherein the prosodic features representative of the received text comprise a relative pitch value of each of the speech segments of the generated acoustic sequence of the speech signals, and wherein the application of the variation in the overall pitch range of the generated acoustic sequence of the speech signals does not alter the relative pitch values.
  - 5. The method of claim 4, wherein the speech segments comprise one of phonemes, syllables, and words.
  - 6. The method of claim 2, wherein the selected paralinguistic variation comprises a variation in a overall speaking rate of the generated acoustic sequence of the speech signals.
  - 7. The method of claim 6, wherein the prosodic features representative of the received text comprise a relative duration of each of the speech segments of the generated acoustic sequence of the speech signals, and wherein the application of the variation in the overall speaking rate of the generated acoustic sequence of the speech signals does not alter the relative durations.
  - 8. The method of claim 7, wherein the speech segments comprise one of phonemes, syllables, and words.
  - 9. The method of claim 2, wherein the selection of the at least one of the plurality of paralinguistic variations is random.
  - 10. The method of claim 2, wherein the selection of the at least one of the plurality of paralinguistic variations is correlated with the prior paralinguistic variation to reflect a gradual change in the sound of the generated acoustic sequence of the speech signals.
  - 11. The method of claim 2, wherein a degree of the selected paralinguistic variation is altered before each application.
  - 12. The method of claim 11, wherein the alteration of the degree of the selected paralinguistic variation is random.
  - 13. The method of claim 11, wherein the alteration of the degree of the selected paralinguistic variation is correlated with the prior paralinguistic variation to reflect a gradual change in the sound of the generated acoustic sequence of the speech signals.

14. An apparatus for producing synthetic speech comprising:
- means for receiving text into a circuit;
  
  means for processing the received text using a prosody model to produce prosodic features representative of the linguistic meaning of the received text;
  
  means for generating an acoustic sequence of speech signals representing the synthesized speech, the acoustic sequence having the prosodic features representative of the processed text;
  
  means for determining a prior paralinguistic variation that has been applied to the acoustic sequence before a current paralinguistic variation; and
  
  means for applying the current paralinguistic variation which includes a mathematical transformation to the acoustic sequence overall, wherein the current paralinguistic variation is determined based on the prior paralinguistic variation, wherein the mathematical transformation does not alter the prosodic features representative of the linguistic meaning of the received text, wherein the current paralinguistic variation is applied to change the sound of the generated acoustic sequence of the speech signals.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 15. The apparatus of claim 14, further comprisingmeans for selecting at least one of the plurality of paralinguistic variations;
    - andmeans for applying the selected paralinguistic variation to the generated acoustic sequence of the speech signals without altering the prosodic features representative of the linguistic meaning of the received text.
  - 16. The apparatus of claim 15, wherein the selected paralinguistic variation comprises a variation in an overall pitch range of the generated acoustic sequence of the speech signals.
  - 17. The apparatus of claim 16, wherein the comprise a relative pitch value of each of the speech segments of the generated acoustic sequence of the speech signals, and wherein the application of the variation in the overall pitch range of the generated acoustic sequence of the speech signals does not alter the relative pitch values.
  - 18. The apparatus of claim 17, wherein the speech segments comprise one of phonemes, syllables, and words.
  - 19. The apparatus of claim 15, wherein the selected paralinguistic variation comprises a variation in a overall speaking rate of the generated acoustic sequence of the speech signals.
  - 20. The apparatus of claim 19, wherein the prosodic features representative of the received text comprise a relative duration of each of the speech segments of the generated acoustic sequence of the speech signals, and wherein the application of the variation in the overall speaking rate of the generated acoustic sequence of the speech signals does not alter the relative durations.
  - 21. The apparatus of claim 20, wherein the speech segments comprise one of phonemes, syllables, and words.
  - 22. The apparatus of claim 15, wherein the selection of the at least one of the plurality of paralinguistic variations is random.
  - 23. The apparatus of claim 15, further comprising means for correlating the at least one of the plurality of paralinguistic variations with the prior paralinguistic variation to reflect a gradual change in the sound of the generated acoustic sequence of the speech signals.
  - 24. The apparatus of claim 15, further comprising means for altering a degree of the selected paralinguistic variation before each application.
  - 25. The apparatus of claim 24, wherein the alteration of the degree of the selected paralinguistic variation is random.
  - 26. The apparatus of claim 24, further comprising means for correlating the degree of alteration of the selected paralinguistic variation with the prior paralinguistic variation to reflect a gradual change in the sound of the generated acoustic sequence of the speech signals.

27. An apparatus comprising:
- a machine-accessible non-transitory medium storing executable instructions which, when executed in a machine, cause the machine to perform a method for synthesizing speech comprising;
  
  processing received text using a prosody model to produce prosodic features representative of the linguistic meaning of the received text;
  
  generating an acoustic sequence of speech signals representing the synthesized speech, the acoustic sequence having the prosodic features representative of the processed text;
  
  determining a prior paralinguistic variation that has been applied to the acoustic sequence before a current paralinguistic variation; and
  
  applying the current paralinguistic variation which includes a mathematical transformation to the acoustic sequence overall, wherein the current paralinguistic variation is determined based on the prior paralinguistic variation, wherein the mathematical transformation does not alter the prosodic features representative of the linguistic meaning of the received text, wherein the current paralinguistic variation is applied to change the sound of the generated acoustic sequence of the speech signals.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34)
- - 28. The apparatus of claim 27, further comprisingselecting at least one of the plurality of paralinguistic variations;
    - andapplying the selected paralinguistic variation to the generated acoustic sequence of the speech signals without altering the prosodic features representative of the linguistic meaning of the received text.
  - 29. The apparatus of claim 28, wherein the selected paralinguistic variation comprises a variation in an overall pitch range of the generated acoustic sequence of the speech signals.
  - 30. The apparatus of claim 29, wherein the prosodic features representative of the received text comprise a relative pitch value of each of the speech segments of the generated acoustic sequence of the speech signals, and wherein the application of the variation in the overall pitch range of the generated acoustic sequence of the speech signals does not alter the relative pitch values.
  - 31. The apparatus of claim 28, wherein the selected paralinguistic variation comprises a variation in a overall speaking rate of the generated acoustic sequence of the speech signals.
  - 32. The apparatus of claim 31, wherein the prosodic features representative of the received text comprise a relative duration of each of the speech segments of the generated acoustic sequence of the speech signals, and wherein the application of the variation in the overall speaking rate of the generated acoustic sequence of the speech signals does not alter the relative durations.
  - 33. The apparatus of claim 28, wherein the selection of the at least one of the plurality of paralinguistic variations is random.
  - 34. The apparatus of claim 28, wherein the selection of the at least one of the plurality of paralinguistic variations is correlated with the prior paralinguistic variation to reflect a gradual change in the sound of the generated acoustic sequence of the speech signals.

35. An apparatus for speech synthesis comprising:
- an input for receiving text signals; and
  
  a circuit coupled to the input, the circuit configured to synthesize an acoustic sequence representing a synthesized speech, the acoustic sequence having one or more of a plurality of prosodic features representative of the linguistic meaning of the received text signals, to determine a prior paralinguistic variation that has been previously applied to the acoustic sequence; and
  
  to paralinguistically vary the synthesized acoustic sequence overall without altering the plurality of prosodic features that include relative pitch values of speech segments in the generated acoustic sequence, wherein paralinguistically varying the synthesized acoustic sequence comprises selecting at least one current paralinguistic variation from a plurality of paralinguistic variations based on the prior paralinguistic variation; and
  
  applying the selected current paralinguistic variation which includes a mathematical transformation to the synthesized acoustic sequence overall, wherein the mathematical transformation does not alter the plurality of prosodic features representative of the linguistic meaning of the received text signals associated with individual phonemes in the acoustic sequence.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
- - 36. The apparatus of claim 35, wherein the selected paralinguistic variation comprises a variation in an overall pitch range of the synthesized acoustic sequence.
  - 37. The apparatus of claim 36, wherein the prosodic features representative of the received text signal comprise a relative pitch value of each of the speech segments of the synthesized acoustic sequence, and wherein the application of the variation in the overall pitch range of the synthesized acoustic sequence does not alter the relative pitch values.
  - 38. The apparatus of claim 37, wherein the speech segments comprise one phonemes, syllables, and words.
  - 39. The apparatus of claim 35, wherein the selected paralinguistic variation comprises a variation in a overall speaking rate of the synthesized acoustic sequence.
  - 40. The apparatus of claim 39, wherein the prosodic features representative of the received text signal comprise a relative duration of each of the speech segments of the synthesized acoustic sequence, and wherein the application of the variation in the overall speaking rate of the synthesized acoustic sequence, does not alter the relative durations.
  - 41. The apparatus of claim 40, wherein the speech segments comprise one of phonemes, syllables, and words.
  - 42. The apparatus of claim 35, wherein the selection of the at least one of the plurality of paralinguistic variations is random.
  - 43. The apparatus of claim 35, wherein the selection of the at least one of the plurality of paralinguistic variations is correlated with the prior to the acoustic sequence to reflect a gradual change in the sound of the synthesized acoustic sequence.
  - 44. The apparatus of claim 35, wherein a degree of the selected paralinguistic variation is altered before each application.
  - 45. The apparatus of claim 44, wherein the alteration of the degree of the selected paralinguistic variation is random.
  - 46. The apparatus of claim 44, wherein the alteration of the degree of the selected paralinguistic variation is correlated with the prior paralinguistic variation to reflect a gradual change in the sound of the synthesized acoustic sequence.
  - 47. The apparatus of claim 35, wherein the circuit comprises a processing device.

48. A speech synthesis process implemented in a machine comprising:
- generating an acoustic speech output representing a synthesized speech in response to an input text, wherein the acoustic speech output comprises one or more of a plurality of prosodic features representative of the linguistic meaning of the input text; and
  
  varying the generated acoustic speech output without altering the plurality of prosodic features that include relative pitch values of speech segments in the generated acoustic sequence, wherein varying the generated acoustic speech output comprisesdetermining a prior paralinguistic variation that has been previously applied to the acoustic sequence;
  
  selecting at least one current paralinguistic variation from a plurality of paralinguistic variations based on the prior paralinguistic variation; and
  
  applying the selected current paralinguistic variation which includes a mathematical transformation to the generated acoustic speech output overall, wherein the mathematical transformation does not alter the plurality of prosodic features representative of the linguistic meaning of the input text.
- View Dependent Claims (49, 50, 51, 52, 53, 54, 55, 56, 57)
- - 49. The process of claim 48, wherein the selected paralinguistic variation comprises a variation in an overall pitch range of the generated speech output.
  - 50. The process of claim 49, wherein the prosodic features representative of the input text comprise a relative pitch value of each of the speech segments of the generated speech output, and wherein the application of the variation in the overall pitch range of the generated speech output does not alter the relative pitch values.
  - 51. The process of claim 48, wherein the selected paralinguistic variation comprises a variation in a overall speaking rate of the generated speech output.
  - 52. The process of claim 51, wherein the prosodic features representative of the input text comprise a relative duration of each of the speech segments of the generated speech output, and wherein the application of the variation in the overall speaking rate of the generated speech output, does not alter the relative durations.
  - 53. The process of claim 48, wherein the selection of the at least one of the plurality of paralinguistic variations is random.
  - 54. The process of claim 48, wherein the selection of the at least one of the plurality of paralinguistic variations is correlated with the prior paralinguistic variation to reflect a gradual change in the sound of the generated speech output.
  - 55. The process of claim 48, wherein a degree of the selected paralinguistic variation is altered before each application.
  - 56. The process of claim 55, wherein the alteration of the degree of the selected paralinguistic variation is random.
  - 57. The process of claim 55, wherein the alteration of the degree of the selected paralinguistic variation is correlated with the prior paralinguistic variation to reflect a gradual change in the sound of the generated speech output.

58. A method for generating a paralinguistic model for use in a speech synthesis system, the method comprising:
- developing, by a processor, one or more of a plurality of paralinguistic variations which include a mathematical transformation that, when applied to a synthesized acoustic sequence of the speech signals representing a synthesized speech, the synthesized acoustic sequence having prosodic features representative of a received text, change the sound of the synthesized acoustic sequence while preserving the prosodic features representative of the linguistic meaning of the received text, wherein the developing includesdetermining, by the processor, a prior paralinguistic variation that has been previously applied to the synthesized acoustic sequence, wherein at least one of the plurality of paralinguistic variations is developed based on the prior paralinguistic variation.
- View Dependent Claims (59)
- - 59. The method of claim 58, wherein the plurality of paralinguistic variations includes one of a variation of an overall pitch range and a variation of an overall speaking rate of the synthesized speech.

60. A speech synthesis system comprising:
- a voice generation device including a processor for outputting an acoustic phoneme sequence having prosodic features representative of a text;
  
  a duration modeling device that provides relative phoneme durations using a phoneme duration model to the voice generation device;
  
  a pitch modeling device coupled to said duration modeling device that, using a pitch model, provides a relative phoneme pitch value for the at least one phoneme to the voice generation device; and
  
  a variation modeling device coupled to the voice generation device that receives the acoustic sequence of synthesized speech signals having the prosodic features including the relative phoneme durations and the relative pitch values from the voice generation device;
  
  determines a prior paralinguistic variation that has been previously applied to the acoustic sequence; and
  
  , using a paralinguistic variation model selected based on the prior paralinguistic variation, varies an overall speaking rate and an overall pitch range of the acoustic sequence of synthesized speech signals by applying a mathematical transformation to the acoustic sequence of synthesized speech signals having the prosodic features overall, wherein the mathematical transformation varies the overall speaking rate and the overall pitch rate without altering the prosodic features.
- View Dependent Claims (61, 62)
- - 61. The system of claim 60, wherein the variation modeling device varies the overall speaking rate by applying a linear transformation to the acoustic sequence of synthesized speech signals.
  - 62. The system of claim 60, wherein the variation modeling device varies the overall pitch range by applying a logarithmic transformation to the acoustic sequence of synthesized speech signals.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Silverman, Kim, Lindsay, Donald
Primary Examiner(s)
Godbold, Douglas

Application Number

US10/718,140
Time in Patent Office

2,988 Days
Field of Search

704/258, 704/260, 704/268
US Class Current

704/260
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 13/10 Prosody rules derived from ...

Method and apparatus for speech synthesis using paralinguistic variation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

73 Citations

62 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for speech synthesis using paralinguistic variation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

73 Citations

62 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links