Altering audio to improve automatic speech recognition
First Claim
1. An apparatus comprising:
- at least one speaker;
at least one microphone;
one or more processors; and
computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to;
cause the at least one speaker to output first content;
receive a first input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone;
detect a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided;
alter, based at least in part on detecting the predefined utterance, output of the first content by the at least one speaker for a first period of time;
receive a second input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time;
send, to one or more remote computing resources, the second input audio signal for identifying the voice command in the second input audio signal; and
cause the at least one speaker to at least one of;
output the first content for a second period of time that is after the first period of time;
oroutput second content for the second period of time, wherein the second content is different from the first content.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques for altering audio being output by a voice-controlled device, or another device, to enable more accurate automatic speech recognition (ASR) by the voice-controlled device. For instance, a voice-controlled device may output audio within an environment using a speaker of the device. While outputting the audio, a microphone of the device may capture sound within the environment and may generate an audio signal based on the captured sound. The device may then analyze the audio signal to identify speech of a user within the signal, with the speech indicating that the user is going to provide a subsequent command to the device. Thereafter, the device may alter the output of the audio (e.g., attenuate the audio, pause the audio, switch from stereo to mono, etc.) to facilitate speech recognition of the user'"'"'s subsequent command.
-
Citations
20 Claims
-
1. An apparatus comprising:
-
at least one speaker; at least one microphone; one or more processors; and computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to; cause the at least one speaker to output first content; receive a first input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone; detect a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided; alter, based at least in part on detecting the predefined utterance, output of the first content by the at least one speaker for a first period of time; receive a second input audio signal generated by the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time; send, to one or more remote computing resources, the second input audio signal for identifying the voice command in the second input audio signal; and cause the at least one speaker to at least one of; output the first content for a second period of time that is after the first period of time;
oroutput second content for the second period of time, wherein the second content is different from the first content. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method implemented at least in part by an apparatus comprising a at least one speaker and at least one microphone, the method comprising:
-
causing the at least one speaker to output first content; generating a first input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone; detecting a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided; determining a type of the first content, altering, based at least in part on the detecting of the predefined utterance and the type of the first content, output of the first content by the at least one speaker for a first period of time; generating a second input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time; sending, to one or more remote computing resources and based at least in part on the detecting of the predefined utterance, the input audio signal for identifying the voice command in the input audio signal; and causing the at least one speaker to at least one of; output the first content for a second period of time that is after the first period of time;
oroutput second content for the second period of time, wherein the second content is different from the first content. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. Computer-readable media storing computer-executable instructions that, when executed on one or more processors of an apparatus, cause the one or more processors to:
-
cause at least one speaker of a device to output first content, the device comprising the at least one speaker and at least one microphone; generate a first input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone; detect a predefined utterance within the first input audio signal, the predefined utterance indicating that a voice command is going to be provided; determine that the first content comprises one of a first content type or a second content type; alter, based at least in part on the detecting of the predefined utterance and based at least in part on the first content comprising the first content type, output of the first content by the at least one speaker in a first manner for a first period of time; alter, based at least in part on the detecting of the predefined utterance and based at least in part on the first content comprising the second content type, output of the first content by the at least one speaker in a second manner for the first period of time; generate a second input audio signal using the at least one microphone based at least in part on sound captured by the at least one microphone during at least a portion of the first period of time; send, to one or more remote computing resources and based at least in part on the detecting of the predefined utterance, the input audio signal for identifying the voice command in the input audio signal; and cause the at least one speaker to at least one of; output the first content for a second period of time that is after the first period of time;
oroutput second content for the second period of time, wherein the second content is different from the first content. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification