Dynamic text-to-speech provisioning
First Claim
1. A computer-implemented method comprising:
- receiving, using one or more microphones, an audio signal from a user associated with a user device;
in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response;
training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user;
determining, by one or more processors and based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device after determining that the application is configured to provide a text-to-speech response;
obtaining, by the one or more processors, data to be audibly output using a computer-synthesized voice;
selecting, by the one or more processors, a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator;
generating, by the one or more processors, an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and
providing, by the one or more processors, the generated audio signal for output by one or more speakers.
2 Assignments
0 Petitions
Accused Products
Abstract
A dynamic text-to-speech (TTS) process and system are described. In response to receiving a command to provide information to a user, a device retrieves information and determines user and environment attributes including: (i) a distance between the device and the user when the user uttered the query; and (ii) voice features of the user. Based on the user and environment attributes, the device determines a likely mood of the user, and a likely environment in which the user and user device are located in. An audio output template matching the likely mood and voice features of the user is selected. The audio output template is also compatible with the environment in which the user and device are located. The retrieved information is converted into an audio signal using the selected audio output template and output by the device.
29 Citations
18 Claims
-
1. A computer-implemented method comprising:
-
receiving, using one or more microphones, an audio signal from a user associated with a user device; in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response; training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user; determining, by one or more processors and based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device after determining that the application is configured to provide a text-to-speech response; obtaining, by the one or more processors, data to be audibly output using a computer-synthesized voice; selecting, by the one or more processors, a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator; generating, by the one or more processors, an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and providing, by the one or more processors, the generated audio signal for output by one or more speakers. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. One or more non-transitory computer-readable storage media comprising instructions, which, when executed by one or more processors, cause the one or more processors to perform operations comprising:
-
receiving, using one or more microphones, an audio signal from a user associated with a user device; in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response; training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user; determining, by one or more processors and based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device after determining that the application is configured to provide a text-to-speech response; obtaining, by the one or more processors, data to be audibly output using a computer-synthesized voice; selecting, by the one or more processors, a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator; generating, by the one or more processors, an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and providing, by the one or more processors, the generated audio signal for output by one or more speakers. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A system comprising:
-
one or more processors and one or more storage devices storing instructions which when executed by the one or more processors, cause the one or more processors to perform operations comprising; receiving, using one or more microphones, an audio signal from a user associated with a user device; in response to receiving the audio signal from the user, determining whether an application, from among a plurality of applications in the user device, is configured to provide a text-to-speech response; training one or more classifiers using training data to identify one or more likely voice features of the user in the audio signal, the one or more likely voice features of the user including a likely tone of the voice of the user; determining, based on the audio signal received using the one or more microphones, a proximity indicator indicative of a distance between the user and the user device, after determining that the application is configured to provide a text-to-speech response; obtaining data to be audibly output using a computer-synthesized voice; selecting a tone of voice of the computer-synthesized voice that corresponds to the likely tone of voice of the user identified by the training of the one or more classifiers, and a volume level of the computer-synthesized voice based on the likely tone of voice of the user and the distance between the user and the user device indicated by the proximity indicator; generating an audio signal based on (i) the data to be audibly output, (ii) the selected tone of voice that corresponds to the likely tone of voice of the user, and (iii) the selected volume level of the computer-synthesized voice; and providing the generated audio signal for output by one or more speakers. - View Dependent Claims (16, 17, 18)
-
Specification