Method and apparatus for generating synthetic speech with contrastive stress
First Claim
1. A method for providing speech output for a speech-enabled application, the method comprising:
- receiving from the speech-enabled application a text input comprising a text transcription of a desired speech output;
generating, using at least one computer system, an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and
providing the audio speech output for the speech-enabled application;
wherein the generating comprises;
identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied;
identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and
assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input;
wherein the assigning comprises;
identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and
assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token.
7 Assignments
0 Petitions
Accused Products
Abstract
Techniques for generating synthetic speech with contrastive stress. In one aspect, a speech-enabled application generates a text input including a text transcription of a desired speech output, and inputs the text input to a speech synthesis system. The synthesis system generates an audio speech output corresponding to at least a portion of the text input, with at least one portion carrying contrastive stress, and provides the audio speech output for the speech-enabled application. In another aspect, a speech-enabled application inputs a plurality of text strings, each corresponding to a portion of a desired speech output, to a software module for rendering contrastive stress. The software module identifies a plurality of audio recordings that render at least one portion of at least one of the text strings as speech carrying contrastive stress. The speech-enabled application generates an audio speech output corresponding to the desired speech output using the audio recordings.
35 Citations
51 Claims
-
1. A method for providing speech output for a speech-enabled application, the method comprising:
-
receiving from the speech-enabled application a text input comprising a text transcription of a desired speech output; generating, using at least one computer system, an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and providing the audio speech output for the speech-enabled application; wherein the generating comprises; identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied; identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input; wherein the assigning comprises; identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. Apparatus for providing speech output for a speech-enabled application, the apparatus comprising:
-
a memory storing a plurality of processor-executable instructions; and at least one processor, operatively coupled to the memory, that executes the instructions to; receive from the speech-enabled application a text input comprising a text transcription of a desired speech output; generate an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and provide the audio speech output for the speech-enabled application; wherein the at least one processor executes the instructions to generate the audio speech output at least in part by; identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied; identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input; wherein the at least one processor executes the instructions to perform the assigning at least in part by; identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for providing speech output for a speech-enabled application, the method comprising:
-
receiving from the speech-enabled application a text input comprising a text transcription of a desired speech output; generating an audio speech output corresponding to at least a portion of the text input, the audio speech output comprising at least one portion carrying contrastive stress to contrast with at least one other portion of the audio speech output; and providing the audio speech output for the speech-enabled application; wherein the generating comprises; identifying a plurality of tokens of the text input of a same text normalization type for which a contrastive stress pattern is to be applied; identifying at least one token of the plurality of tokens to be rendered with contrastive stress; and assigning contrastive stress to be carried by at least one portion of the audio speech output corresponding to at least one portion of the at least one token of the text input; wherein the assigning comprises; identifying at least one first portion of the at least one token of the plurality of tokens that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion of the at least one token that does not differ from at least one corresponding second portion of the at least one other token; and assigning contrastive stress to be carried by at least one first portion of the audio speech output corresponding to the identified at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the identified at least one second portion of the at least one token. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. A method for providing speech output via a speech-enabled application, the method comprising:
-
generating, using at least one computer system executing the speech-enabled application, a text input comprising a text transcription of a desired speech output, the text input comprising a plurality of tokens of a same text normalization type for which a contrastive stress pattern is to be applied, at least one token of the plurality of tokens comprising at least one first portion that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion that does not differ from at least one corresponding second portion of the at least one other token; inputting the text input to at least one speech synthesis engine configured to assign contrastive stress to be carried by at least one first portion of an audio speech output corresponding to the at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the at least one second portion of the at least one token; receiving the audio speech output from the at least one speech synthesis engine; and providing the audio speech output to at least one user of the speech-enabled application. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35)
-
-
36. Apparatus for providing speech output via a speech-enabled application, the apparatus comprising:
-
a memory storing a plurality of processor-executable instructions; and at least one processor, operatively coupled to the memory, that executes the instructions to; generate a text input comprising a text transcription of a desired speech output, the text input comprising a plurality of tokens of a same text normalization type for which a contrastive stress pattern is to be applied, at least one token of the plurality of tokens comprising at least one first portion that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion that does not differ from at least one corresponding second portion of the at least one other token; input the text input to at least one speech synthesis engine configured to assign contrastive stress to be carried by at least one first portion of an audio speech output corresponding to the at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the at least one second portion of the at least one token; receive the audio speech output from the at least one speech synthesis engine; and provide the audio speech output to at least one user of the speech-enabled application. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43)
-
-
44. At least one non-transitory computer-readable storage medium encoded with a plurality of computer-executable instructions that, when executed, perform a method for providing speech output via a speech-enabled application, the method comprising:
-
generating a text input comprising a text transcription of a desired speech output, the text input comprising a plurality of tokens of a same text normalization type for which a contrastive stress pattern is to be applied, at least one token of the plurality of tokens comprising at least one first portion that differs from at least one corresponding first portion of at least one other token of the plurality of tokens, and at least one second portion that does not differ from at least one corresponding second portion of the at least one other token; inputting the text input to at least one speech synthesis engine configured to assign contrastive stress to be carried by at least one first portion of an audio speech output corresponding to the at least one first portion of the at least one token, but not to at least one second portion of the audio speech output corresponding to the at least one second portion of the at least one token; receiving the audio speech output from the at least one speech synthesis engine; and providing the audio speech output to at least one user of the speech-enabled application. - View Dependent Claims (45, 46, 47, 48, 49, 50, 51)
-
Specification