Text-to-speech processing with emphasized output audio

US 10,319,365 B1
Filed: 06/27/2016
Issued: 06/11/2019
Est. Priority Date: 06/27/2016
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method comprising:

receiving, from a first speech-controlled device, first input audio data corresponding to a command to receive audio data;

performing automatic speech recognition on the audio data to generate first text;

determining a duration corresponding to how long at least one word is pronounced in the first input audio data;

determining, based on the duration, a first portion of the audio data corresponding to a first word of the first text has a volume greater than a second portion of the audio data corresponding to other words in the first text;

associating a first speech synthesis markup language (SSML) tag with the first word, the SSML tag indicating the first word is to be emphasized;

performing text-to-speech (TTS) processing on the first text, using the first SSML tag, to create output audio data, the output audio data including emphasized speech corresponding to the first word; and

sending, to a second speech-controlled device, the output speech audio data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for generating output audio with emphasized portions are described. Spoken audio is obtained and undergoes speech processing (e.g., ASR and optionally NLU) to create text. It may be determined that the resulting text includes a portion that should be emphasized (e.g., an interjection) using at least one of knowledge of an application run on a device that captured the spoken audio, prosodic analysis, and/or linguistic analysis. The portion of text to be emphasized may be tagged (e.g., using a Speech Synthesis Markup Language (SSML) tag). TTS processing is then performed on the tagged text to create output audio including an emphasized portion corresponding to the tagged portion of the text.

32 Citations

View as Search Results

20 Claims

1. A computer implemented method comprising:
- receiving, from a first speech-controlled device, first input audio data corresponding to a command to receive audio data;
  
  performing automatic speech recognition on the audio data to generate first text;
  
  determining a duration corresponding to how long at least one word is pronounced in the first input audio data;
  
  determining, based on the duration, a first portion of the audio data corresponding to a first word of the first text has a volume greater than a second portion of the audio data corresponding to other words in the first text;
  
  associating a first speech synthesis markup language (SSML) tag with the first word, the SSML tag indicating the first word is to be emphasized;
  
  performing text-to-speech (TTS) processing on the first text, using the first SSML tag, to create output audio data, the output audio data including emphasized speech corresponding to the first word; and
  
  sending, to a second speech-controlled device, the output speech audio data.
- View Dependent Claims (2, 3, 4)
- - 2. The computer-implemented method of claim 1, further comprising:
    - determining the first text includes a second word;
      
      accessing a user profile associated with the first speech-controlled device, the user profile including a table of words to be emphasized;
      
      determining the second word is in the table;
      
      associating the second word with a second SSML tag, the second SSML tag indicating the second word is to be emphasized; and
      
      performing further TTS processing using the second SSML tag and the second word to create further output audio data, the further output audio data including emphasized speech corresponding to the second word.
  - 3. The computer-implemented method of claim 1, wherein performing the text-to-speech processes comprises:
    - selecting, from a first database of pre-stored emphasized speech units, a first pre-stored emphasized speech unit corresponding to the first word;
      
      selecting, from a second database of pre-stored non-emphasized speech units, a second pre-stored non-emphasized speech unit corresponding to a third word adjacent to the first word in the first text; and
      
      combining the first pre-stored emphasized speech unit and the second pre-stored non-emphasized speech unit to create a first portion of the output speech audio data, the first portion corresponding to the first and third words.
  - 4. The computer-implemented method of claim 1, wherein the first word has a first non-emphasized portion, a middle emphasized portion, and a second non-emphasized portion, and wherein performing the text-to-speech processing comprises:
    - selecting a first non-emphasized speech unit from a first database corresponding to the first portion;
      
      selecting a first emphasized speech unit from a second database corresponding to the middle portion;
      
      selecting a second non-emphasized speech unit from the first database corresponding to the second portion; and
      
      combining the first non-emphasized, middle emphasized, and second non-emphasized speech units to create a first portion of output speech audio data, the first portion corresponding to the first word.

5. A system comprising:
- at least one processor; and
  
  a memory including instructions operable to be executed by the at least one processor to perform a set of actions to configure the at least one processor to;
  
  receive, from a first device, input audio data;
  
  perform automatic speech recognition on the input audio data to create text including at least one word;
  
  determine a duration corresponding to how long the at least one word is pronounced in the input audio data;
  
  determine, based on the duration, that the at least one word is to be emphasized relative to other words in the text; and
  
  perform text-to-speech processing on the text to create output speech audio data, the output speech audio data including emphasized speech corresponding to the at least one word.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The system of claim 5, wherein the instructions further configure the at least one processor to:
    - receive second text from an application running on the first device;
      
      access a table of words to be emphasized associated with the application; and
      
      identify the at least one word within the table.
  - 7. The system of claim 5, further comprising:
    - determining a volume associated with the at least one word.
  - 8. The system of claim 5, wherein the instructions further configure the at least one processor to:
    - perform natural language understanding (NLU) on the text to determine NLU results; and
      
      determine at least one word within the NLU results is typically emphasized in communications.
  - 9. The system of claim 5, wherein the instructions further configure the at least one processor to:
    - determine the at least one word in the input audio data is pronounced for a duration of time that exceeds a threshold duration of time; and
      
      determine the at least one word is to be emphasized further based on the duration of time that exceeds the threshold.
  - 10. The system of claim 5, wherein the instructions further configure the at least one processor to:
    - determine an operating application corresponding to the first device;
      
      send, to a server associated with the operating application, the text; and
      
      receive, from the server, a tag indicating a word to be emphasized in text-to-speech output.
  - 11. The system of claim 5, wherein determining the at least one word comprises:
    - determining a punctuation indicator proximate to the at least one word.
  - 12. The system of claim 5, wherein the instructions further configure the at least one processor to:
    - determine the at least one word is associated with emphasis alternatives;
      
      determine an example pronunciation of the at least one word; and
      
      determine an emphasis for the at least one word by comparing acoustic properties of the portion of the input audio data corresponding to the at least one word to the example pronunciation.

13. A computer-implemented method comprising:
- receiving, from a first device, input audio data;
  
  performing automatic speech recognition on the input audio data to create text including at least one word;
  
  determining a duration corresponding to how long the at least one word is pronounced in the input audio data;
  
  determining, based on the duration, that the at least one word that is to be emphasized relative to other words in the text; and
  
  performing text-to-speech processing on the text to create output speech audio data, the output speech audio data including emphasized speech corresponding to the at least one word.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The computer-implemented method of claim 13, further comprising:
    - receiving second text from an application running on the first device;
      
      accessing a table of words to be emphasized associated with the application; and
      
      identifying the at least one word within the table.
  - 15. The computer-implemented method of claim 13, further comprising:
    - determining a volume associated with the at least one word.
  - 16. The computer-implemented method of claim 13, further comprising:
    - performing natural language understanding (NLU) on the text to determine NLU results; and
      
      determining at least one word within the NLU results is typically emphasized in communications.
  - 17. The computer-implemented method of claim 13, wherein the method further comprises:
    - determining the at least one word in the input audio data is pronounced for a duration of time that exceeds a threshold duration of time; and
      
      determining the at least one word is to be emphasized based on the duration of time that exceeds the threshold.
  - 18. The computer-implemented method of claim 13, further comprising:
    - determining an operating application corresponding to the first device;
      
      sending, to a server associated with the operating application, the text; and
      
      receiving, from the server, a tag indicating a word to be emphasized in text-to-speech output.
  - 19. The computer-implemented method of claim 13, wherein determining the at least one word comprises:
    - determining a punctuation indicator proximate to the at least one word.
  - 20. The computer-implemented method of claim 13, further comprising:
    - determining the at least one word is associated with emphasis alternatives;
      
      determining an example pronunciation of the at least one word; and
      
      determining an emphasis for the at least one word by comparing acoustic properties of the a portion of the input audio data corresponding to the at least one word to the example pronunciation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Nicolis, Marco, Nadolski, Adam Franciszek
Primary Examiner(s)
Shah, Bharatkumar S

Application Number

US15/193,437
Time in Patent Office

1,079 Days
Field of Search

704235
US Class Current
CPC Class Codes

G10L 13/06   Elementary speech units use...

G10L 13/10   Prosody rules derived from ...

G10L 15/26   Speech to text systems G10L...

G10L 2013/105   Duration

Text-to-speech processing with emphasized output audio

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

32 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Text-to-speech processing with emphasized output audio

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

32 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links