Acoustic echo cancellation using visual cues

US 9,767,828 B1
Filed: 06/27/2012
Issued: 09/19/2017
Est. Priority Date: 06/27/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

ascertaining, from one or more images of a location, that a person at the location is speaking;

detecting, by at least one of a double-talk detector of an acoustic echo processor or by a voice activity detector of the acoustic echo processor, that an audio signal associated with a voice is generated by a microphone at the location;

determining, by the acoustic echo processor, a confidence score indicating a likelihood that the audio signal is associated with the person at the location;

adjusting, by the acoustic echo processor, the confidence score based at least in part on the one or more images depicting the person at the location engaged in speaking;

determining that the confidence score exceeds a threshold;

changing, by the acoustic echo processor, adaptation of a filter of the acoustic echo processor based at least in part on the confidence score exceeding the threshold and the detecting the audio signal associated with the voice; and

removing, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques for enhancing an acoustic echo canceller based on visual cues are described herein. The techniques include changing adaptation of a filter of the acoustic echo canceller, calibrating the filter, or reducing background noise from an audio signal processed by the acoustic echo canceller. The changing, calibrating, and reducing are responsive to visual cues that describe acoustic characteristics of a location of a device that includes the acoustic echo canceller. Such visual cues may indicate that no human being is present at the location, that some subject(s) are engaged in speaking or sound generating activities, or that motion associated with an echo path change has occurred at the location.

Citations

24 Claims

1. A computer-implemented method comprising:
- ascertaining, from one or more images of a location, that a person at the location is speaking;
  
  detecting, by at least one of a double-talk detector of an acoustic echo processor or by a voice activity detector of the acoustic echo processor, that an audio signal associated with a voice is generated by a microphone at the location;
  
  determining, by the acoustic echo processor, a confidence score indicating a likelihood that the audio signal is associated with the person at the location;
  
  adjusting, by the acoustic echo processor, the confidence score based at least in part on the one or more images depicting the person at the location engaged in speaking;
  
  determining that the confidence score exceeds a threshold;
  
  changing, by the acoustic echo processor, adaptation of a filter of the acoustic echo processor based at least in part on the confidence score exceeding the threshold and the detecting the audio signal associated with the voice; and
  
  removing, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 24)
- - 2. The method of claim 1, wherein the detecting that the audio signal associated with the voice into the microphone is performed by the double-talk detector and the method further comprises:
    - receiving, by the double-talk detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication.
  - 3. The method of claim 1, wherein the detecting that the audio signal associated with the voice into the microphone is performed by the voice activity detector and the method further comprises:
    - receiving, by the voice activity detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication.
  - 4. The method of claim 1, wherein the method further comprises:
    - determining, based at least in part on the one or more images, that an item at the location has changed position;
      
      determining that the change in position of the item is associated with a corresponding change in the known echo path; and
      
      accelerating the adaptation of the filter based at least in part on determining that the change in position of the item is associated with the corresponding change in the known echo path.
  - 5. The method of claim 1, wherein the ascertaining comprises determining if the one or more images of the location show movement of lips of the person in a specified time period.
  - 6. The method of claim 1, wherein the method further comprises:
    - determining that the location does not include people;
      
      capturing audio at the location with the microphone based at least in part on determining that the location does not include people; and
      
      determining the known echo path based at least in part on the audio captured at the location.
  - 7. The method of claim 1, wherein the changing comprises halting or slowing the adaptation of the filter based at least in part on a determination that the confidence score exceeds the threshold.
  - 8. The method of claim 1, wherein the method further comprises:
    - determining that the audio signal associated with the voice is no longer detected by the microphone; and
      
      resuming the adaptation of the filter.
  - 9. The method of claim 1, wherein the method further comprises removing, at least in part, an acoustic echo from the audio signal, wherein an amount of the acoustic echo removed from the audio signal is based at least in part on the known echo path.
  - 24. The method of claim 1, wherein determining the confidence score indicating the likelihood that the audio signal is associated with the person at the location further comprises:
    - accessing at least one stored speech characteristic associated with a voice profile corresponding to the person;
      
      determining a comparison between the at least one stored speech characteristic and a characteristic of the audio signal; and
      
      determining the confidence score based at least in part on the comparison.

10. One or more non-transitory computer-readable media having computer-executable instructions stored thereon and configured to program a computing device to perform operations comprising:
- ascertaining, from one or more images of a location, that a person at the location is speaking;
  
  capturing an audio signal by a microphone at the location;
  
  detecting that the audio signal is associated with a human voice;
  
  determining a first confidence score based at least in part on a first indication that the human voice is associated with the person at the location;
  
  determining a second confidence score based at least in part on a second indication that the one or more images depict the person at the location engaged in speaking, the second confidence score being greater than the first confidence score;
  
  changing adaptation of a filter of an acoustic echo processor based at least in part on at least one of the first confidence score or the second confidence score and the audio signal; and
  
  removing, at least in part, background noise from the audio signal based at least in part on the at least one of the first confidence score or the second confidence score, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The non-transitory computer-readable media of claim 10, wherein the detecting that the audio signal is the human voice is performed by at least one of a double-talk detector of the acoustic echo processor or by a voice activity detector of the acoustic echo processor.
  - 12. The non-transitory computer-readable media of claim 11, wherein the detecting that the audio signal is the human voice is performed by the double-talk detector and the operations further comprise:
    - receiving, by the double-talk detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, anddetermining the second confidence score based at least in part on the indication.
  - 13. The non-transitory computer-readable media of claim 11, wherein the detecting that the audio signal is the human voice is performed by the voice activity detector and the operations further comprise:
    - receiving, by the voice activity detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting based on the indication.
  - 14. The non-transitory computer-readable media of claim 10, wherein the operations further comprise receiving an indication that the person at the location is engaged in speaking, the receiving performed by the acoustic echo processor.
  - 15. The non-transitory computer-readable media of claim 10, wherein the ascertaining comprises determining whether the one or more images of the location show movement of lips of the person in a specified time period.
  - 16. The non-transitory computer-readable media of claim 10, wherein the changing comprises halting or slowing the adaptation of the filter based at least in part on a determination that the first confidence score or the second confidence score exceeds a threshold.
  - 17. The non-transitory computer-readable media of claim 10, the operations further comprising determining that a subsequent audio signal is not associated with a human voice and resuming the adaptation of the filter.
  - 18. The non-transitory computer-readable media of claim 10, the operations further comprising performing acoustic echo processing on the audio signal to remove, at least in part, an acoustic echo.

19. A system comprising:
- one or more processors;
  
  a camera to capture one or more images of a location;
  
  a speaker to output audio in the location;
  
  a microphone to capture audio in the location; and
  
  one or more non-transitory computer-readable media having computer-executable instructions stored thereon and configured to program the one or more processors to perform operations comprising;
  
  ascertaining, from the one or more images of the location captured by the camera, that a person at the location is speaking;
  
  capturing an audio signal by the microphone in the location;
  
  detecting that the audio signal is associated with a voice;
  
  determining a confidence score that the audio signal represents the voice that is associated with the person at the location;
  
  adjusting the confidence score based at least in part on the one or more images captured by the camera depicting the person at the location engaged in speaking;
  
  determining that the confidence score exceeds a threshold;
  
  changing adaptation of a filter of an acoustic echo processor based at least in part on the confidence score exceeding the threshold and the audio signal; and
  
  removing, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location.
- View Dependent Claims (20, 21, 22, 23)
- - 20. The system of claim 19, wherein the detecting that the audio signal is associated with the voice is performed by a double-talk detector of the acoustic echo processor or by a voice activity detector of the acoustic echo processor.
  - 21. The system of claim 20, wherein the detecting that the audio signal is associated with the voice is performed by the double-talk detector and the operations further comprise:
    - receiving, by the double-talk detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication.
  - 22. The system of claim 20, wherein the detecting that the audio signal is associated with the voice is performed by the voice activity detector and the operations further comprise:
    - receiving, by the voice activity detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication.
  - 23. The system of claim 19, wherein the operations further comprise:
    - determining, from one or more subsequent images of the location captured by the camera, that the location does not include people;
      
      playing a calibration sound by the speaker;
      
      determining one or more echo paths of the location based on audio captured by the microphone at the location while the calibration sound is playing;
      
      calibrating the acoustic echo canceller filter based at least in part on the one or more echo paths; and
      
      learning background noise characteristics in the audio captured by the microphone at the location based at least in part on no voice activity being detected.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.), Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Velusamy, Kavitha, Chu, Wai C., Gopalan, Ramya, Chhetri, Amit S.
Primary Examiner(s)
Kelley, Christopher S
Assistant Examiner(s)
BEASLEY, DEIRDRE L

Application Number

US13/535,232
Time in Patent Office

1,910 Days
Field of Search

348143
US Class Current
CPC Class Codes

G06V 40/10   Human or animal bodies, e.g...

G06V 40/168   Feature extraction; Face re...

G06V 40/20   Movements or behaviour, e.g...

G10L 17/00   Speaker identification or v...

G10L 2021/02082   the noise being echo, rever...

G10L 21/0208   Noise filtering

G10L 25/78   Detection of presence or ab...

H04B 3/23   using a replica of transmit...

H04B 3/231   Echo cancellers using reado...

H04N 21/44218   Detecting physical presence...

Acoustic echo cancellation using visual cues

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Acoustic echo cancellation using visual cues

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links