Method and apparatus for providing speech output for speech-enabled applications

US 8,949,128 B2
Filed: 02/12/2010
Issued: 02/03/2015
Est. Priority Date: 02/12/2010
Status: Active Grant

First Claim

Patent Images

1. A method for providing, from a synthesis system, a speech output for a speech-enabled application, the method comprising:

receiving from the speech-enabled application, at the synthesis system, a text input comprising a text transcription of a desired speech output;

selecting, using at least one computer system implementing the synthesis system, at least one audio recording provided by a developer of the speech-enabled application who is not a developer of the synthesis system, the at least one audio recording corresponding to at least a first portion of the text input; and

providing for the speech-enabled application, from the synthesis system, a speech output comprising the at least one audio recording.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for providing speech output for speech-enabled applications. A synthesis system receives from a speech-enabled application a text input including a text transcription of a desired speech output. The synthesis system selects one or more audio recordings corresponding to one or more portions of the text input. In one aspect, the synthesis system selects from audio recordings provided by a developer of the speech-enabled application. In another aspect, the synthesis system selects an audio recording of a speaker speaking a plurality of words. The synthesis system forms a speech output including the one or more selected audio recordings and provides the speech output for the speech-enabled application.

66 Citations

View as Search Results

30 Claims

1. A method for providing, from a synthesis system, a speech output for a speech-enabled application, the method comprising:
- receiving from the speech-enabled application, at the synthesis system, a text input comprising a text transcription of a desired speech output;
  
  selecting, using at least one computer system implementing the synthesis system, at least one audio recording provided by a developer of the speech-enabled application who is not a developer of the synthesis system, the at least one audio recording corresponding to at least a first portion of the text input; and
  
  providing for the speech-enabled application, from the synthesis system, a speech output comprising the at least one audio recording.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, further comprising concatenating the at least one audio recording and at least one additional audio segment to produce the speech output.
  - 3. The method of claim 2, wherein the at least one additional audio segment is selected from the group consisting of at least one additional audio recording, at least one concatenative text to speech (TTS) synthesis segment, at least one formant synthesis segment and at least one articulatory synthesis segment.
  - 4. The method of claim 1, further comprising:
    - in response to determining that no audio recording corresponding to a second portion of the text input has been provided by the developer of the speech-enabled application, creating, using text to speech (TTS) synthesis, at least one additional audio segment corresponding to the second portion of the text input; and
      
      concatenating at least the at least one audio recording and the at least one additional audio segment to produce the speech output.
  - 5. The method of claim 1, wherein the at least one audio recording is selected based at least in part on a normalized orthography of the at least the first portion of the text input.
  - 6. The method of claim 1, wherein the at least one audio recording is selected based at least in part on at least one constraint indicated by metadata associated with the at least one audio recording.
  - 7. The method of claim 6, wherein the metadata is provided by the developer of the speech-enabled application.
  - 8. The method of claim 1, wherein the at least one audio recording is selected from a plurality of audio recordings corresponding to the at least the first portion of the text input, the at least one audio recording being selected based at least in part on at least one constraint indicated by metadata associated with the at least one audio recording, the metadata being provided by the developer of the speech-enabled application.
  - 9. The method of claim 1, wherein the at least one audio recording is selected based at least in part on an indication of contrastive stress in the text input.
  - 10. The method of claim 1, further comprising playing the speech output via the speech-enabled application.
  - 11. The method of claim 1, further comprising providing at least one interface allowing the developer of the speech-enabled application to provide the at least one audio recording.
  - 12. The method of claim 11, wherein the at least one interface further allows the developer of the speech-enabled application to provide metadata associated with the at least one audio recording.
  - 13. The method of claim 11, wherein the at least one interface further allows the developer of the speech-enabled application to provide templates for text inputs to be created by the speech-enabled application.
  - 14. The method of claim 1, wherein the speech-enabled application is an interactive voice response (IVR) application.
  - 15. The method of claim 1, wherein providing the speech output comprises storing the speech output in at least one audio file.
  - 16. The method of claim 1, wherein providing the speech output comprises streaming data encoding the speech output to the speech-enabled application.

17. Apparatus comprising at least one processor configured to:
- receive from a speech-enabled application, at a synthesis system, a text input comprising a text transcription of a desired speech output;
  
  select, via the synthesis system, at least one audio recording provided by a developer of the speech-enabled application who is not a developer of the synthesis system, the at least one audio recording corresponding to at least a first portion of the text input; and
  
  provide for the speech-enabled application, from the synthesis system, a speech output comprising the at least one audio recording.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The apparatus of claim 17, wherein the at least one processor is further configured to concatenate the at least one audio recording and at least one additional audio segment to produce the speech output.
  - 19. The apparatus of claim 17, wherein the at least one processor is further configured to:
    - in response to determining that no audio recording corresponding to a second portion of the text input has been provided by the developer of the speech-enabled application, create, using text to speech (TTS) synthesis, at least one additional audio segment corresponding to the second portion of the text input; and
      
      concatenate at least the at least one audio recording and the at least one additional audio segment to produce the speech output.
  - 20. The apparatus of claim 17, wherein the at least one processor is configured to select the at least one audio recording based at least in part on a normalized orthography of the at least the first portion of the text input.
  - 21. The apparatus of claim 17, wherein the at least one processor is configured to select the at least one audio recording based at least in part on at least one constraint indicated by metadata associated with the at least one audio recording, wherein the metadata is provided by the developer of the speech-enabled application.
  - 22. The apparatus of claim 17, wherein the at least one processor is configured to select the at least one audio recording from a plurality of audio recordings corresponding to the at least the first portion of the text input, the at least one audio recording being selected based at least in part on at least one constraint indicated by metadata associated with the at least one audio recording, the metadata being provided by the developer of the speech-enabled application.
  - 23. The apparatus of claim 17, wherein the at least one processor is configured to select the at least one audio recording based at least in part on an indication of contrastive stress in the text input.

24. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for providing a speech output for a speech-enabled application from a synthesis system, the method comprising:
- receiving from the speech-enabled application, at the synthesis system, a text input comprising a text transcription of a desired speech output;
  
  selecting, via the synthesis system, at least one audio recording provided by a developer of the speech-enabled application who is not a developer of the synthesis system, the at least one audio recording corresponding to at least a first portion of the text input; and
  
  providing for the speech-enabled application, from the synthesis system, a speech output comprising the at least one audio recording.
- View Dependent Claims (25, 26, 27, 28, 29, 30)
- - 25. The at least one non-transitory computer-readable storage medium of claim 24, wherein the method further comprises concatenating the at least one audio recording and at least one additional audio segment to produce the speech output.
  - 26. The at least one non-transitory computer-readable storage medium of claim 24, wherein the method further comprises:
    - in response to determining that no audio recording corresponding to a second portion of the text input has been provided by the developer of the speech-enabled application, creating, using text to speech (TTS) synthesis, at least one additional audio segment corresponding to the second portion of the text input; and
      
      concatenating at least the at least one audio recording and the at least one additional audio segment to produce the speech output.
  - 27. The at least one non-transitory computer-readable storage medium of claim 24, wherein the at least one audio recording is selected based at least in part on a normalized orthography of the at least the first portion of the text input.
  - 28. The at least one non-transitory computer-readable storage medium of claim 24, wherein the at least one audio recording is selected based at least in part on at least one constraint indicated by metadata associated with the at least one audio recording, wherein the metadata is provided by the developer of the speech-enabled application.
  - 29. The at least one non-transitory computer-readable storage medium of claim 24, wherein the at least one audio recording is selected from a plurality of audio recordings corresponding to the at least the first portion of the text input, the at least one audio recording being selected based at least in part on at least one constraint indicated by metadata associated with the at least one audio recording, the metadata being provided by the developer of the speech-enabled application.
  - 30. The at least one non-transitory computer-readable storage medium of claim 24, wherein the at least one audio recording is selected based at least in part on an indication of contrastive stress in the text input.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Meyer, Darren C., Bos-Plachez, Corinne, Staessen, Martine Marguerite
Primary Examiner(s)
COLUCCI, MICHAEL C

Application Number

US12/704,859
Publication Number

US 20110202344A1
Time in Patent Office

1,817 Days
Field of Search

704/260, 704/270.1, 704/275, 704/271, 704/270, 704/258, 704/234, 704/235, 704/209, 434/236, 434/178, 379/88.16
US Class Current

704/260
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/04   Details of speech synthesis...

G10L 13/08   Text analysis or generation...

Method and apparatus for providing speech output for speech-enabled applications

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

66 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for providing speech output for speech-enabled applications

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

66 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links