Systems and methods for text-to-speech synthesis using spoken example

US 20050071163A1
Filed: 09/26/2003
Published: 03/31/2005
Est. Priority Date: 09/26/2003
Status: Active Grant

First Claim

Patent Images

1. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for speech synthesis, the method steps comprising:

determining prosodic parameters of a spoken utterance;

automatically generating a marked-up text corresponding to the spoken utterance using the prosodic parameters; and

generating a synthetic waveform using the marked-up text.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for speech synthesis and, in particular, text-to-speech systems and methods for converting a text input to a synthetic waveform by processing prosodic and phonetic content of a spoken example of the text input to accurately mimic the input speech style and pronunciation. Systems and methods provide an interface to a TTS system to allow a user to input a text string and a spoken utterance of the text string, extract prosodic parameters from the spoken input, and process the prosodic parameters to derive corresponding markup for the text input to enable a more natural sounding synthesized speech.

100 Citations

View as Search Results

24 Claims

1. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for speech synthesis, the method steps comprising:
- determining prosodic parameters of a spoken utterance;
  
  automatically generating a marked-up text corresponding to the spoken utterance using the prosodic parameters; and
  
  generating a synthetic waveform using the marked-up text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The program storage device of claim 1, wherein the instructions for determining prosodic parameters comprise instructions for determining pitch contour, duration contour or energy contour information of the spoken utterance, or any combination thereof.
  - 3. The program storage device of claim 1, further comprising instructions for aligning the spoken utterance with a corresponding text string.
  - 4. The program storage device of claim 3, wherein the instructions for aligning comprise instructions for extracting acoustic feature data from the spoken utterance and time-aligning the spoken input to the corresponding text string using the acoustic feature data.
  - 5. The program storage device of claim 3, wherein the alignment is performed using Viterbi alignment process.
  - 6. The program storage device of claim 3, wherein the alignment is performed on a phoneme level.
  - 7. The program storage device of claim 1, wherein the instructions for automatically generating a marked-up text comprise instruction for directly specifying the prosodic parameters as attribute values for mark-up elements.
  - 8. The program storage device of claim 1, wherein the instructions for automatically generating a marked-up text comprise instructions for assigning abstract labels to the prosodic parameters to generate a high-level markup.
  - 9. The program storage device of claim 1, wherein the marked-up text is generated using SSML (speech synthesis markup language).
  - 10. The program storage device of claim 1, further comprising instruction for processing phonetic content of the spoken utterance to generate the synthetic waveform having a desired pronunciation.

11. A method for speech synthesis, comprising the steps of:
- determining prosodic parameters of a spoken utterance;
  
  automatically generating a marked-up text corresponding to the spoken utterance using the prosodic parameters; and
  
  generating a synthetic waveform using the marked-up text.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The method of claim 11, wherein the determining prosodic parameters comprises determining pitch contour, duration contour or energy contour information of the spoken utterance, or any combination thereof.
  - 13. The method of claim 11, further comprising aligning the spoken utterance with a corresponding text string.
  - 14. The method of claim 13, wherein aligning comprises extracting acoustic feature data from the spoken utterance and time-aligning the spoken input to the corresponding text string using the acoustic feature data.
  - 15. The method of claim 13, wherein aligning is performed using Viterbi alignment process.
  - 16. The method of claim 13, wherein aligning is performed on a phoneme level.
  - 17. The method of claim 11, wherein automatically generating a marked-up text comprises directly specifying the prosodic parameters as attribute values for mark-up elements.
  - 18. The method of claim 11, wherein automatically generating a marked-up text comprises assigning abstract labels to the prosodic parameters to generate a high-level markup.
  - 19. The method of claim 11, wherein the marked-up text is generated using SSML (speech synthesis markup language).
  - 20. The method of claim 11, further comprising processing phonetic content of the spoken utterance to generate the synthetic waveform having a desired pronunciation.

21. A text-to-speech (TTS) system, comprising:
- a prosody analyzer for determining prosodic parameters of a spoken utterance and automatically generating a marked-up text corresponding to the spoken utterance using the prosodic parameters; and
  
  a TTS system for generating a synthetic waveform using the marked-up text.
- View Dependent Claims (22, 23, 24)
- - 22. The system of claim 21, further comprising a user interface that enables a user to input the spoken utterance and input a text string corresponding to the spoken utterance.
  - 23. The system of claim 21, wherein the prosody analyzer processes phonetic content of the spoken utterance to generate the synthetic waveform having a desired pronunciation.
  - 24. The system of claim 21, wherein the prosody analyzer comprises:
    - a pitch contour extraction module for determining pitch contour information for the spoken utterance;
      
      an alignment module for aligning the input text string with the spoken utterance to determine duration contour information of elements comprising the input text string; and
      
      a conversion module for including markup in the input text string in accordance with the duration and pitch contour information to generate the marked up text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Hamza, Wael M., Aaron, Andy, Eide, Ellen M., Bakis, Raimo

Granted Patent

US 8,886,538 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Systems and methods for text-to-speech synthesis using spoken example

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

100 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for text-to-speech synthesis using spoken example

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

100 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links