Dynamic text-to-speech output
First Claim
1. A computer-implemented method, comprising:
- receiving, from a first device, first audio data representing a first utterance;
determining a first number of words represented in the first audio data;
determining user profile data associated with the first utterance;
determining, in the user profile data, an average number of words of a user command;
determining the first number of words deviates from the average number of words;
determining template text data based at least in part on the first number of words deviating from the average number of words;
receiving, from a second device, first text data responsive to the first utterance;
populating the template text data with the first text data to generate second text data;
performing text-to-speech (TTS) processing on the second text data to generate second audio data, the second audio data corresponding to a non-default characteristic of speech based at least in part on the first number of words deviating from the average number of words; and
causing the first device to emit audio corresponding to the second audio data.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods, and devices for dynamically outputting TTS content are disclosed. A speech-controlled device captures a spoken command, and sends audio data corresponding thereto to a server(s). The server(s) determines output content responsive to the spoken command. The server(s) may also determine a user that spoke the command and determine an average speech characteristic (e.g., tone, pitch, speed, number of words, etc.) used by the user when speaking commands. The server(s) may also determine a speech characteristic of the presently spoken command, as well as determine a difference between the speech characteristic of the presently spoken command and the average speech characteristic of the user. The server(s) may then cause the speech-controlled device to output audio based on the difference.
56 Citations
20 Claims
-
1. A computer-implemented method, comprising:
-
receiving, from a first device, first audio data representing a first utterance; determining a first number of words represented in the first audio data; determining user profile data associated with the first utterance; determining, in the user profile data, an average number of words of a user command; determining the first number of words deviates from the average number of words; determining template text data based at least in part on the first number of words deviating from the average number of words; receiving, from a second device, first text data responsive to the first utterance; populating the template text data with the first text data to generate second text data; performing text-to-speech (TTS) processing on the second text data to generate second audio data, the second audio data corresponding to a non-default characteristic of speech based at least in part on the first number of words deviating from the average number of words; and causing the first device to emit audio corresponding to the second audio data. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system, comprising:
-
at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to; receive first audio data representing a first utterance, the first audio data being associated with user profile data; determine first text data responsive to the first utterance; perform text-to-speech (TTS) processing on the first text data to generate second audio data; cause a first device to output audio corresponding to the second audio data; receive third audio data representing a second utterance, the third audio data being associated with the user profile data; based at least in part on the user profile data being associated with the first audio data and the third audio data, determine an amount of time between receipt of the first audio data and receipt of the third audio data; determine, based at least in part on the amount of time, second text data responsive to the second utterance; and perform TTS processing on the second text data. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A computer-implemented method, comprising:
-
receiving first audio data representing a first utterance, the first audio data being associated with user profile data; determining first text data responsive to the first utterance; performing text-to-speech (TTS) processing on the first text data to generate second audio data; causing a first device to output audio corresponding to the second audio data; receiving third audio data representing a second utterance, the third audio data being associated with the user profile data; based at least in part on the user profile data being associated with the first audio data and the third audio data, determining an amount of time between receipt of the first audio data and receipt of the third audio data; determining second text data responsive to the second utterance; and performing, based at least in part on the amount of time, TTS processing on the second text data. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification