Dynamic text-to-speech output

US 10,276,149 B1
Filed: 12/21/2016
Issued: 04/30/2019
Est. Priority Date: 12/21/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving, from a first device, first audio data representing a first utterance;

determining a first number of words represented in the first audio data;

determining user profile data associated with the first utterance;

determining, in the user profile data, an average number of words of a user command;

determining the first number of words deviates from the average number of words;

determining template text data based at least in part on the first number of words deviating from the average number of words;

receiving, from a second device, first text data responsive to the first utterance;

populating the template text data with the first text data to generate second text data;

performing text-to-speech (TTS) processing on the second text data to generate second audio data, the second audio data corresponding to a non-default characteristic of speech based at least in part on the first number of words deviating from the average number of words; and

causing the first device to emit audio corresponding to the second audio data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and devices for dynamically outputting TTS content are disclosed. A speech-controlled device captures a spoken command, and sends audio data corresponding thereto to a server(s). The server(s) determines output content responsive to the spoken command. The server(s) may also determine a user that spoke the command and determine an average speech characteristic (e.g., tone, pitch, speed, number of words, etc.) used by the user when speaking commands. The server(s) may also determine a speech characteristic of the presently spoken command, as well as determine a difference between the speech characteristic of the presently spoken command and the average speech characteristic of the user. The server(s) may then cause the speech-controlled device to output audio based on the difference.

56 Citations

View as Search Results

20 Claims

1. A computer-implemented method, comprising:
- receiving, from a first device, first audio data representing a first utterance;
  
  determining a first number of words represented in the first audio data;
  
  determining user profile data associated with the first utterance;
  
  determining, in the user profile data, an average number of words of a user command;
  
  determining the first number of words deviates from the average number of words;
  
  determining template text data based at least in part on the first number of words deviating from the average number of words;
  
  receiving, from a second device, first text data responsive to the first utterance;
  
  populating the template text data with the first text data to generate second text data;
  
  performing text-to-speech (TTS) processing on the second text data to generate second audio data, the second audio data corresponding to a non-default characteristic of speech based at least in part on the first number of words deviating from the average number of words; and
  
  causing the first device to emit audio corresponding to the second audio data.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computer-implemented method of claim 1, further comprising:
    - receiving, from the first device, third audio data representing a second utterance;
      
      determining third text data responsive to the second utterance;
      
      determining an electronic calendar associated with the user profile data;
      
      determining the electronic calendar represents a first user will be busy within a threshold amount of time of receipt of the third audio data; and
      
      performing TTS processing on the third text data to generate third audio data, the third audio data having a first rate of speech based at least in part on the electronic calendar representing the first user will be busy within the threshold amount of time, the first rate of speech being greater than a default rate of speech.
  - 3. The computer-implemented method of claim 1, further comprising:
    - receiving third audio data representing a second utterance;
      
      determining the user profile data is associated with the second utterance;
      
      based at least in part on the user profile data being associated with the first utterance and the second utterance, determining an amount of time between receipt of the first audio data and receipt of the third audio data;
      
      determining third text data responsive to the second utterance; and
      
      performing, based at least in part on the amount of time, TTS processing on the third text data.
  - 4. The computer-implemented method of claim 1, further comprising:
    - receiving third audio data representing a second utterance;
      
      performing automatic speech recognition (ASR) processing on the third audio data to generate third text data;
      
      performing natural language understanding (NLU) processing on the third text data to generate NLU results data;
      
      determining the NLU results data represents a first speed that content is to be output at;
      
      determining first content responsive to the second utterance; and
      
      causing the first content to be output at the first speed.
  - 5. The computer-implemented method of claim 1, further comprising:
    - receiving third audio data representing a second utterance;
      
      performing automatic speech recognition (ASR) processing on the third audio data to generate third text data;
      
      performing natural language understanding (NLU) processing on the third text data to generate NLU results data;
      
      determining the NLU results data represents at least a first portion to be omitted from content to be output;
      
      determining first content responsive to the second utterance; and
      
      causing the first content to be output without the at least a first portion.
  - 6. The computer-implemented method of claim 1, further comprising:
    - receiving third audio data representing a second utterance;
      
      determining third text data responsive to the second utterance;
      
      determining the user profile data is associated with the third audio data;
      
      determining a geographic location represented in the user profile data; and
      
      performing, based at least in part on the geographic location, TTS processing on the third text data.

7. A system, comprising:
- at least one processor; and
  
  at least one memory including instructions that, when executed by the at least one processor, cause the system to;
  
  receive first audio data representing a first utterance, the first audio data being associated with user profile data;
  
  determine first text data responsive to the first utterance;
  
  perform text-to-speech (TTS) processing on the first text data to generate second audio data;
  
  cause a first device to output audio corresponding to the second audio data;
  
  receive third audio data representing a second utterance, the third audio data being associated with the user profile data;
  
  based at least in part on the user profile data being associated with the first audio data and the third audio data, determine an amount of time between receipt of the first audio data and receipt of the third audio data;
  
  determine, based at least in part on the amount of time, second text data responsive to the second utterance; and
  
  perform TTS processing on the second text data.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The system of claim 7, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive fourth audio data representing a third utterance, the fourth audio data being associated with the user profile data;
      
      determine third text data responsive to the third utterance;
      
      determine an electronic calendar associated with the user profile data;
      
      determine the electronic calendar indicates a first user will be busy within a threshold amount of time from when the fourth audio data was received; and
      
      perform, based at least in part on the electronic calendar indicating the first user will be busy, TTS processing on the third text data.
  - 9. The system of claim 8, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - determine the electronic calendar includes an entry with a starting time prior to a first time at which an entirety of second audio, corresponding to the third text data, would be output using default characteristics.
  - 10. The system of claim 7, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive fourth audio data representing a third utterance;
      
      perform automatic speech recognition (ASR) processing on the fourth audio data to generate third text data;
      
      perform natural language understanding (NLU) processing on the third text data to generate NLU results data;
      
      determine the NLU results data represents a first speed that content is to be output at;
      
      determine first content responsive to the third utterance; and
      
      cause the first content to be output at the first speed.
  - 11. The system of claim 7, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive fourth audio data representing a third utterance;
      
      perform automatic speech recognition (ASR) processing on the fourth audio data to generate third text data;
      
      perform natural language understanding (NLU) processing on the third text data to generate NLU results data;
      
      determine the NLU results data represents at least a first portion to be omitted from content to be output;
      
      determine first content responsive to the third utterance; and
      
      cause the first content to be output without the at least a first portion.
  - 12. The system of claim 7, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive fourth audio data representing a third utterance, the fourth audio data being associated with the user profile data;
      
      determine third text data responsive to the third utterance;
      
      determine a geographic location represented in the user profile data; and
      
      perform, based at least in part on the geographic location, TTS processing on the third text data.
  - 13. The system of claim 7, wherein the at least one memory further includes instructions that, when executed by the at least one processor, further cause the system to:
    - receive fourth audio data representing a third utterance;
      
      determine third text data responsive to the third utterance;
      
      perform TTS processing on the third text data to generate fifth audio data;
      
      cause the first device to output second audio corresponding to the fifth audio data;
      
      after the first device has begun outputting the second audio, but prior to the first device completing outputting the second audio, receive sixth audio data representing a fourth utterance;
      
      determine the fourth utterance corresponds to a request to change an output speed of the second audio; and
      
      cause the output speed to be changed.

14. A computer-implemented method, comprising:
- receiving first audio data representing a first utterance, the first audio data being associated with user profile data;
  
  determining first text data responsive to the first utterance;
  
  performing text-to-speech (TTS) processing on the first text data to generate second audio data;
  
  causing a first device to output audio corresponding to the second audio data;
  
  receiving third audio data representing a second utterance, the third audio data being associated with the user profile data;
  
  based at least in part on the user profile data being associated with the first audio data and the third audio data, determining an amount of time between receipt of the first audio data and receipt of the third audio data;
  
  determining second text data responsive to the second utterance; and
  
  performing, based at least in part on the amount of time, TTS processing on the second text data.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The computer-implemented method of claim 14, further comprising:
    - receiving fourth audio data representing a third utterance;
      
      determining the third utterance corresponds to a request to change an output speed of second audio corresponding to the second text data; and
      
      causing the output speed to be changed.
  - 16. The computer-implemented method of claim 14, further comprising:
    - receiving fourth audio data representing a third utterance;
      
      performing automatic speech recognition (ASR) processing on the fourth audio data to generate third text data;
      
      performing natural language understanding (NLU) processing on the third text data to generate NLU results data;
      
      determining the NLU results data represents a first speed that content is to be output at;
      
      determining first content responsive to the third utterance; and
      
      causing the first content to be output at the first speed.
  - 17. The computer-implemented method of claim 14, further comprising:
    - receiving fourth audio data representing a third utterance;
      
      performing automatic speech recognition (ASR) processing on the fourth audio data to generate third text data;
      
      performing natural language understanding (NLU) processing on the third text data to generate NLU results data;
      
      determining the NLU results data represents at least a first portion to be omitted from content to be output;
      
      determining first content responsive to the third utterance; and
      
      causing the first content to be output without the at least a first portion.
  - 18. The computer-implemented method of claim 14, further comprising:
    - receiving fourth audio data representing a third utterance, the fourth audio data being associated with the user profile data;
      
      determining third text data responsive to the third utterance;
      
      determining a geographic location represented in the user profile data; and
      
      performing, based at least in part on the geographic location, TTS processing on the third text data.
  - 19. The computer-implemented method of claim 14, further comprising:
    - receiving fourth audio data representing a third utterance, the fourth audio data being associated with the user profile data;
      
      determining third text data responsive to the third utterance;
      
      determining an electronic calendar associated with the user profile data;
      
      determining the electronic calendar indicates a first user will be busy within a threshold amount of time from when the fourth audio data was received; and
      
      performing, based at least in part on the electronic calendar indicating the first user will be busy, TTS processing on the third text data.
  - 20. The computer-implemented method of claim 19, further comprising:
    - determining the electronic calendar includes an entry with a starting time prior to a first time at which an entirety of second audio, corresponding to the third text data, would be output using default characteristics.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Liang, Nancy Yi, Barnet, Aaron Takayanagi
Primary Examiner(s)
Godbold, Douglas

Application Number

US15/386,333
Time in Patent Office

860 Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/06   Elementary speech units use...

G10L 15/26   Speech to text systems G10L...

Dynamic text-to-speech output

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

56 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Dynamic text-to-speech output

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links