Method and apparatus for generating synthetic speech with contrastive stress
First Claim
1. A method for use with a speech-enabled application, the method comprising:
- receiving, from the speech-enabled application, input comprising a plurality of text strings;
identifying a first portion of a first text string of the plurality of text strings that differs from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string that does not differ from a corresponding second portion of the second text string;
assigning contrastive stress to the identified first portion of the first text string, but not to the identified second portion of the first text string;
generating, using at least one computer system, speech synthesis output corresponding to the plurality of text strings, the speech synthesis output identifying a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string as speech carrying contrastive stress, to contrast with the rendering of the second text string; and
providing the speech synthesis output for the speech-enabled application.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
-
Citations
15 Claims
-
1. A method for use with a speech-enabled application, the method comprising:
-
receiving, from the speech-enabled application, input comprising a plurality of text strings; identifying a first portion of a first text string of the plurality of text strings that differs from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string that does not differ from a corresponding second portion of the second text string; assigning contrastive stress to the identified first portion of the first text string, but not to the identified second portion of the first text string; generating, using at least one computer system, speech synthesis output corresponding to the plurality of text strings, the speech synthesis output identifying a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string as speech carrying contrastive stress, to contrast with the rendering of the second text string; and providing the speech synthesis output for the speech-enabled application. - View Dependent Claims (2, 3, 4)
-
-
5. Apparatus for use with a speech-enabled application, the apparatus comprising:
-
a memory storing a plurality of processor-executable instructions; and at least one processor, operatively coupled to the memory, configured to execute the instructions to; receive from the speech-enabled application, input comprising a plurality of text strings; identify a first portion of a first text string of the plurality of text strings that differs from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string that does not differ from a corresponding second portion of the second text string; assign contrastive stress to the identified first portion of the first text string, but not to the identified second portion of the first text string; generate speech synthesis output corresponding to the plurality of text strings, the speech synthesis output identifying a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string as speech carrying contrastive stress, to contrast with the rendering of the second text string; and provide the speech synthesis output for the speech-enabled application. - View Dependent Claims (6, 7, 8)
-
-
9. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for use with a speech-enabled application, the method comprising:
-
receiving, from the speech-enabled application, input comprising a plurality of text strings; identifying a first portion of a first text string of the plurality of text strings that differs from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string that does not differ from a corresponding second portion of the second text string; assigning contrastive stress to the identified first portion of the first text string, but not to the identified second portion of the first text string; generating speech synthesis output corresponding to the plurality of text strings, the speech synthesis output identifying a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string as speech carrying contrastive stress, to contrast with the rendering of the second text string; and providing the speech synthesis output for the speech-enabled application. - View Dependent Claims (10, 11, 12)
-
-
13. A method for generating speech output via a speech-enabled application, the method comprising:
-
generating, using at least one computer system executing the speech-enabled application, a plurality of text strings, each of the plurality of text strings corresponding to a portion of a desired speech output, wherein a first portion of a first text string of the plurality of text strings differs from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string does not differ from a corresponding second portion of the second text string; inputting the plurality of text strings to at least one software module for rendering contrastive stress; receiving output from the at least one software module, the output identifying a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string as speech carrying contrastive stress, to contrast with the rendering of the second text string, and at least one other of the plurality of audio recordings being selected to render the second portion of the first text string as speech not carrying contrastive stress; and generating, using the plurality of audio recordings, an audio speech output corresponding to the desired speech output.
-
-
14. Apparatus for generating speech output via a speech-enabled application, the apparatus comprising:
-
a memory storing a plurality of processor-executable instructions; and at least one processor, operatively coupled to the memory, configured to execute the instructions to; generate a plurality of text strings, each of the plurality of text strings corresponding to a portion of a desired speech output, wherein a first portion of a first text string of the plurality of text strings differs from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string does not differ from a corresponding second portion of the second text string; input the plurality of text strings to at least one software module for rendering contrastive stress; receive output from the at least one software module, the output identifying a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string as speech carrying contrastive stress, to contrast with the rendering of the second text string, and at least one other of the plurality of audio recordings being selected to render the second portion of the first text string as speech not carrying contrastive stress; and generate, using the plurality of audio recordings, an audio speech output corresponding to the desired speech output.
-
-
15. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for generating speech output via a speech-enabled application, the method comprising:
-
generating a plurality of text strings, each of the plurality of text strings corresponding to a portion of a desired speech output, wherein a first portion of a first text string of the plurality of text strings differs from a corresponding first portion of a second text string of the plurality of text strings, and a second portion of the first text string does not differ from a corresponding second portion of the second text string; inputting the plurality of text strings to at least one software module for rendering contrastive stress; receiving output from the at least one software module, the output identifying a plurality of audio recordings to render the plurality of text strings as speech, at least one of the plurality of audio recordings being selected to render the first portion of the first text string as speech carrying contrastive stress, to contrast with the rendering of the second text string, and at least one other of the plurality of audio recordings being selected to render the second portion of the first text string as speech not carrying contrastive stress; and generating, using the plurality of audio recordings, an audio speech output corresponding to the desired speech output.
-
Specification