Text-to-speech processing with emphasized output audio
First Claim
1. A computer implemented method comprising:
- receiving, from a first speech-controlled device, first input audio data corresponding to a command to receive audio data;
performing automatic speech recognition on the audio data to generate first text;
determining a duration corresponding to how long at least one word is pronounced in the first input audio data;
determining, based on the duration, a first portion of the audio data corresponding to a first word of the first text has a volume greater than a second portion of the audio data corresponding to other words in the first text;
associating a first speech synthesis markup language (SSML) tag with the first word, the SSML tag indicating the first word is to be emphasized;
performing text-to-speech (TTS) processing on the first text, using the first SSML tag, to create output audio data, the output audio data including emphasized speech corresponding to the first word; and
sending, to a second speech-controlled device, the output speech audio data.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods for generating output audio with emphasized portions are described. Spoken audio is obtained and undergoes speech processing (e.g., ASR and optionally NLU) to create text. It may be determined that the resulting text includes a portion that should be emphasized (e.g., an interjection) using at least one of knowledge of an application run on a device that captured the spoken audio, prosodic analysis, and/or linguistic analysis. The portion of text to be emphasized may be tagged (e.g., using a Speech Synthesis Markup Language (SSML) tag). TTS processing is then performed on the tagged text to create output audio including an emphasized portion corresponding to the tagged portion of the text.
32 Citations
20 Claims
-
1. A computer implemented method comprising:
-
receiving, from a first speech-controlled device, first input audio data corresponding to a command to receive audio data; performing automatic speech recognition on the audio data to generate first text; determining a duration corresponding to how long at least one word is pronounced in the first input audio data; determining, based on the duration, a first portion of the audio data corresponding to a first word of the first text has a volume greater than a second portion of the audio data corresponding to other words in the first text; associating a first speech synthesis markup language (SSML) tag with the first word, the SSML tag indicating the first word is to be emphasized; performing text-to-speech (TTS) processing on the first text, using the first SSML tag, to create output audio data, the output audio data including emphasized speech corresponding to the first word; and sending, to a second speech-controlled device, the output speech audio data. - View Dependent Claims (2, 3, 4)
-
-
5. A system comprising:
-
at least one processor; and a memory including instructions operable to be executed by the at least one processor to perform a set of actions to configure the at least one processor to; receive, from a first device, input audio data; perform automatic speech recognition on the input audio data to create text including at least one word; determine a duration corresponding to how long the at least one word is pronounced in the input audio data; determine, based on the duration, that the at least one word is to be emphasized relative to other words in the text; and perform text-to-speech processing on the text to create output speech audio data, the output speech audio data including emphasized speech corresponding to the at least one word. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method comprising:
-
receiving, from a first device, input audio data; performing automatic speech recognition on the input audio data to create text including at least one word; determining a duration corresponding to how long the at least one word is pronounced in the input audio data; determining, based on the duration, that the at least one word that is to be emphasized relative to other words in the text; and performing text-to-speech processing on the text to create output speech audio data, the output speech audio data including emphasized speech corresponding to the at least one word. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification