Text-to-speech processing using previously speech processed data
First Claim
1. A computer-implemented method, comprising:
- receiving first input audio data corresponding to an utterance;
performing automatic speech recognition processing on the first input audio data to create first reference text data;
in a database associated with a first user profile, storing a first association between the first input audio data and the first reference text data;
receiving, from a first device associated with the first user profile, a message intended for a second device, the message including first text data;
determining the first text data corresponds to the first reference text data;
identifying the first input audio data in the database based at least in part on the first association;
causing the first device to output a visual indication representing that the first text data corresponds to the first reference text data;
generating, after causing the first device to output the visual indication, output audio data including the first input audio data; and
sending, to the second device, the output audio data.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems, methods, and devices for generating text-to-speech output using previously captured speech are described. Spoken audio is obtained and undergoes speech processing to create text. The resulting text is stored with the spoken audio, with both the text and the spoken audio being associated with the individual that spoke the audio. Various spoken audio and corresponding text are stored over time to create a library of speech units. When the individual sends a text message to a recipient, the text message is processed to determine portions of text, and the portions of text are compared to the library of text associated with the individual. When text in the library is identified, the system selects the spoken audio units associated with the identified stored text. The selected spoken audio units are then used to generate output audio data corresponding to the original text message, with the output audio data being sent to a device of the message recipient.
57 Citations
18 Claims
-
1. A computer-implemented method, comprising:
-
receiving first input audio data corresponding to an utterance; performing automatic speech recognition processing on the first input audio data to create first reference text data; in a database associated with a first user profile, storing a first association between the first input audio data and the first reference text data; receiving, from a first device associated with the first user profile, a message intended for a second device, the message including first text data; determining the first text data corresponds to the first reference text data; identifying the first input audio data in the database based at least in part on the first association; causing the first device to output a visual indication representing that the first text data corresponds to the first reference text data; generating, after causing the first device to output the visual indication, output audio data including the first input audio data; and sending, to the second device, the output audio data. - View Dependent Claims (2, 3)
-
-
4. A system, comprising:
-
at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to; receive first input audio data corresponding to at least one first utterance associated with user profile data; perform automatic speech recognition processing on the first input audio data to create first text data; associate the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data, the at least a portion of the first input audio data having a first prosodic characteristic; receive second text data; determine the second text data is associated with the user profile data; determine the first text data corresponds to at least a portion of the second text data; determine the at least a portion of the second text data is associated with third text data, the third text data being associated with second input audio data having a second prosodic characteristic; perform prosodic analysis processing on the second text data to determine a third prosodic characteristic; determine the third prosodic characteristic at least substantially matches the first prosodic characteristic; generate, after determining the third prosodic characteristic at least substantially matches the first prosodic characteristic, output audio data using the at least a portion of the first input audio data; and send the output audio data to a first device. - View Dependent Claims (5, 6, 7, 8, 9, 10)
-
-
11. A computer-implemented method, comprising:
-
receiving first input audio data corresponding to at least one first utterance associated with user profile data; performing automatic speech recognition processing on the first input audio data to create first text data; associating the first text data with the user profile data, the first text data being associated with at least a portion of the first input audio data; receiving second text data; determining the second text data is associated with the user profile data; determining the first text data corresponds to a first portion of the second text data; causing a first device to request further input audio data corresponding to at least one second utterance corresponding to a second portion of the second text data; receiving, from the first device and after causing the first device to request further input audio data, second input audio data; associating the second portion of the second text data with the second input audio data and the user profile data; generating output audio data using the at least a portion of the first input audio data and the second input audio data; and sending the output audio data to a second device. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A system, comprising:
-
at least one processor; and at least one memory including instructions that, when executed by the at least one processor, cause the system to; receive input audio data corresponding to at least one utterance associated with user profile data; perform automatic speech recognition processing on the input audio data to create first text data; associate the first text data with the user profile data and at least a portion of the input audio data; receive second text data representing at least a first word; determine the second text data is associated with the user profile data; determine the first word does not correspond to the first text data; perform natural language understanding processing on the second text data; determine a second word having a similar meaning as the first word; determine the second word corresponds to the first text data; send, to a first device, first data representing the second word; receive, from the first device, an indication representing the second word is to be used; generate output audio data using the second word; and send the output audio data to a second device.
-
Specification