Acoustic echo cancellation using visual cues
First Claim
1. A computer-implemented method comprising:
- ascertaining, from one or more images of a location, that a person at the location is speaking;
detecting, by at least one of a double-talk detector of an acoustic echo processor or by a voice activity detector of the acoustic echo processor, that an audio signal associated with a voice is generated by a microphone at the location;
determining, by the acoustic echo processor, a confidence score indicating a likelihood that the audio signal is associated with the person at the location;
adjusting, by the acoustic echo processor, the confidence score based at least in part on the one or more images depicting the person at the location engaged in speaking;
determining that the confidence score exceeds a threshold;
changing, by the acoustic echo processor, adaptation of a filter of the acoustic echo processor based at least in part on the confidence score exceeding the threshold and the detecting the audio signal associated with the voice; and
removing, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques for enhancing an acoustic echo canceller based on visual cues are described herein. The techniques include changing adaptation of a filter of the acoustic echo canceller, calibrating the filter, or reducing background noise from an audio signal processed by the acoustic echo canceller. The changing, calibrating, and reducing are responsive to visual cues that describe acoustic characteristics of a location of a device that includes the acoustic echo canceller. Such visual cues may indicate that no human being is present at the location, that some subject(s) are engaged in speaking or sound generating activities, or that motion associated with an echo path change has occurred at the location.
-
Citations
24 Claims
-
1. A computer-implemented method comprising:
-
ascertaining, from one or more images of a location, that a person at the location is speaking; detecting, by at least one of a double-talk detector of an acoustic echo processor or by a voice activity detector of the acoustic echo processor, that an audio signal associated with a voice is generated by a microphone at the location; determining, by the acoustic echo processor, a confidence score indicating a likelihood that the audio signal is associated with the person at the location; adjusting, by the acoustic echo processor, the confidence score based at least in part on the one or more images depicting the person at the location engaged in speaking; determining that the confidence score exceeds a threshold; changing, by the acoustic echo processor, adaptation of a filter of the acoustic echo processor based at least in part on the confidence score exceeding the threshold and the detecting the audio signal associated with the voice; and removing, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 24)
-
-
10. One or more non-transitory computer-readable media having computer-executable instructions stored thereon and configured to program a computing device to perform operations comprising:
-
ascertaining, from one or more images of a location, that a person at the location is speaking; capturing an audio signal by a microphone at the location; detecting that the audio signal is associated with a human voice; determining a first confidence score based at least in part on a first indication that the human voice is associated with the person at the location; determining a second confidence score based at least in part on a second indication that the one or more images depict the person at the location engaged in speaking, the second confidence score being greater than the first confidence score; changing adaptation of a filter of an acoustic echo processor based at least in part on at least one of the first confidence score or the second confidence score and the audio signal; and removing, at least in part, background noise from the audio signal based at least in part on the at least one of the first confidence score or the second confidence score, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A system comprising:
-
one or more processors; a camera to capture one or more images of a location; a speaker to output audio in the location; a microphone to capture audio in the location; and one or more non-transitory computer-readable media having computer-executable instructions stored thereon and configured to program the one or more processors to perform operations comprising; ascertaining, from the one or more images of the location captured by the camera, that a person at the location is speaking; capturing an audio signal by the microphone in the location; detecting that the audio signal is associated with a voice; determining a confidence score that the audio signal represents the voice that is associated with the person at the location; adjusting the confidence score based at least in part on the one or more images captured by the camera depicting the person at the location engaged in speaking; determining that the confidence score exceeds a threshold; changing adaptation of a filter of an acoustic echo processor based at least in part on the confidence score exceeding the threshold and the audio signal; and removing, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location. - View Dependent Claims (20, 21, 22, 23)
-
Specification