Generating paralinguistic phenomena via markup in text-to-speech synthesis

US 7,472,065 B2
Filed: 06/04/2004
Issued: 12/30/2008
Est. Priority Date: 06/04/2004
Status: Active Grant

First Claim

Patent Images

1. A method of converting marked-up text into a synthesized stream, comprising:

providing marked-up text to a processor-based system;

converting the marked-up text into a text stream comprising a plurality of vocabulary items;

retrieving a plurality audio segments corresponding to the plurality of vocabulary items;

concatenating the plurality of audio segments to form a synthesized stream; and

audibly outputting the synthesized stream;

wherein the marked-up text comprises a normal text and a paralinguistic text;

wherein the normal text is differentiated from the paralinguistic text by using a grammar constraint; and

wherein the paralinguistic text is associated with more than one audio segment, wherein the retrieving of the plurality audio segments comprises selecting one audio segment associated with the paralinguistic text.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Converting marked-up text into a synthesized stream includes providing marked-up text to a processor-based system, converting the marked-up text into a text stream including vocabulary items, retrieving audio segments corresponding to the vocabulary items, concatenating the audio segments to form a synthesized stream, and audibly outputting the synthesized stream, wherein the marked-up text includes a normal text and a paralinguistic text; and wherein the normal text is differentiated from the paralinguistic text by using a grammar constraint, and wherein the paralinguistic text is associated with more than one audio segment, wherein the retrieving of the plurality audio segments includes selecting one audio segment associated with the paralinguistic text.

Citations

25 Claims

1. A method of converting marked-up text into a synthesized stream, comprising:
- providing marked-up text to a processor-based system;
  
  converting the marked-up text into a text stream comprising a plurality of vocabulary items;
  
  retrieving a plurality audio segments corresponding to the plurality of vocabulary items;
  
  concatenating the plurality of audio segments to form a synthesized stream; and
  
  audibly outputting the synthesized stream;
  
  wherein the marked-up text comprises a normal text and a paralinguistic text;
  
  wherein the normal text is differentiated from the paralinguistic text by using a grammar constraint; and
  
  wherein the paralinguistic text is associated with more than one audio segment, wherein the retrieving of the plurality audio segments comprises selecting one audio segment associated with the paralinguistic text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the paralinguistic text comprises non-speech sounds.
  - 3. The method of claim 2, wherein the non-speech sounds comprise at least one of a breath, a cough, a sigh, a filled pause, and a hesitation.
  - 4. The method of claim 1, wherein the normal text comprises speech sounds.
  - 5. The method of claim 4, wherein the speech sounds comprise sounds with a word equivalent.
  - 6. The method of claim 1, farther comprising determining an emotional context of the marked-up text.
  - 7. The method of claim 6, wherein the step of retrieving further comprises choosing the plurality of audio segments corresponding to the emotional context of the marked-up text, wherein the selected one audio segment associated with the paralinguistic text is selected according to the emotional context.
  - 8. The method of claim 6, wherein the step of concatenating further comprises concatenating the plurality of audio segments based on the emotional context of the marked-up text.
  - 9. The method of claim 8, wherein concatenating the plurality of audio segments based on the emotional context of the marked-up text comprises setting the prosody of the synthesized stream based on the emotional context of the marked-up text.
  - 10. The method of claim 6, wherein the step of audibly outputting the synthesized stream comprises audibly outputting the synthesized stream based on the emotional context of the marked-up text, wherein the selected one audio segment associated with the paralinguistic text is selected randomly.
  - 11. The method of claim 10, wherein the step of audibly outputting the synthesized stream based on the emotional context of the marked-up text comprises audibly outputting the synthesized stream at a prosody based on the emotional context of the marked-up text.

12. A method of converting paralinguistic text into a synthesized stream, comprising:
- providing paralinguistic text to a processor-based system;
  
  converting the paralinguistic into a text stream comprising a plurality of vocabulary items;
  
  retrieving a plurality of audio examples corresponding to the plurality of vocabulary items;
  
  concatenating the plurality of audio examples to form a synthesized stream; and
  
  audibly outputting the synthesized stream;
  
  wherein the paralinguistic text comprise non-speech sounds indicating an emotional state underlying the paralinguistic text; and
  
  wherein the paralinguistic text is associated with more than one audio segment, wherein the retrieving of the plurality audio segments comprises selecting one audio segment associated with the paralinguistic text.
- View Dependent Claims (13)
- - 13. The method of claim 12, wherein the non-speech sounds comprise at least one of a breath, a cough, a sigh, a filled pause, and a hesitation.

14. A system of converting marked-up text into a synthesized stream, comprising:
- means for providing marked-up text to a processor-based system;
  
  means for converting the marked-up text into a text stream comprising a plurality of vocabulary items;
  
  means for retrieving a plurality of audio examples corresponding to the plurality of vocabulary items;
  
  means for concatenating the plurality of audio examples to form a synthesized stream; and
  
  means for audibly outputting the synthesized stream;
  
  wherein the marked-up text comprises a normal text and a paralinguistic text; and
  
  wherein the normal text is differentiated from the paralinguistic text by using a grammar constraint; and
  
  wherein the paralinguistic text is associated with more than one audio segment, wherein the retrieving of the plurality audio segments comprises selecting one audio segment associated with the paralinguistic text.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 15. The system of claim 14, wherein the normal text comprises speech sounds and the paralinguistic text comprises non-speech sounds.
  - 16. The system of claim 15, wherein the non-speech sounds comprise at least one of a breath, a cough, a sigh, a filled pause, and a hesitation.
  - 17. The system of claim 16, wherein the plurality of audio examples are prerecorded.
  - 18. The system of claim 17, wherein the plurality of audio examples are prerecorded using one speaker.
  - 19. The system of claim 17, wherein the plurality of audio examples are prerecorded using a plurality of speakers.
  - 20. The system of claim 14, wherein the plurality of audio examples corresponding to the plurality of vocabulary items comprises at least one audio example corresponding to each of the plurality of vocabulary items.
  - 21. The system of claim 14, wherein each of the plurality of vocabulary items comprises a phoneme.
  - 22. The system of claim 14, wherein the grammar constraint comprises markup.
  - 23. The system of claim 14, further comprising a database for storing the plurality of audio examples.

24. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for converting marked- up text into a synthesized stream, the method steps comprising:
- providing marked-up text to a processor-based system;
  
  converting the marked-up text into a text stream comprising a plurality of vocabulary items;
  
  retrieving a plurality audio segments corresponding to the plurality of vocabulary items;
  
  concatenating the plurality of audio segments to form a synthesized stream; and
  
  audibly outputting the synthesized stream;
  
  wherein the marked-up text comprises a normal text and a paralinguistic text;
  
  wherein the normal text is differentiated from the paralinguistic text by using a grammar constraint; and
  
  wherein the paralinguistic text is associated with more than one audio segment, wherein the retrieving of the plurality audio segments comprises selecting one audio segment associated with the paralinguistic text.

25. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for converting paralinguistic text into a synthesized stream, the method steps comprising:
- providing paralinguistic text to a processor-based system;
  
  converting the paralinguistic into a text stream comprising a plurality of vocabulary items;
  
  retrieving a plurality of audio examples corresponding to the plurality of vocabulary items;
  
  concatenating the plurality of audio examples to form a synthesized stream; and
  
  audibly outputting the synthesized stream;
  
  wherein the paralinguistic text comprise non-speech sounds indicating an emotional state underlying the paralinguistic text; and
  
  wherein the paralinguistic text is associated with more than one audio segment, wherein the retrieving of the plurality audio segments comprises selecting one audio segment associated with the paralinguistic text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Eide, Ellen M., Aaron, Andrew S., Hamza, Wael, Bakis, Raimo
Primary Examiner(s)
Lerner; Martin

Application Number

US10/861,055
Publication Number

US 20050273338A1
Time in Patent Office

1,670 Days
Field of Search

704/258, 704/260, 704/261, 704/266, 704/267, 704/269, 379/88.16
US Class Current

704/258
CPC Class Codes

G10L 13/08 Text analysis or generation...

Generating paralinguistic phenomena via markup in text-to-speech synthesis

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Generating paralinguistic phenomena via markup in text-to-speech synthesis

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links