Robust short-time fourier transform acoustic echo cancellation during audio playback

US 10,446,165 B2
Filed: 09/27/2017
Issued: 10/15/2019
Est. Priority Date: 09/27/2017
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

an audio stage comprising an audio processor and an audio amplifier;

one or more speakers;

one or more microphones;

one or more processors;

data storage storing instructions executable by the one or more processors that cause the system to perform operations comprising;

causing, via the audio stage, the one or more speakers to play back audio content;

while the audio content is playing back via the one or more speakers, capturing, via the one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content;

receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers;

transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment;

transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers;

during each n^thiteration of an acoustic echo canceller (AEC);

determining an n^thframe of an output signal, wherein determining the n^thframe of the output signal comprises;

generating an n^thframe of a model signal by passing an n^thframe of the reference signal through an n^thinstance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and

generating the n^thframe of the output signal by redacting the n^thframe of the model signal from an n^thframe of the measured signal;

determining a n+1^thinstance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1^thinstance of the adaptive filter for the next iteration of the AEC comprises;

determining an n^thframe of an error signal, the n^thframe of the error signal representing a difference between the n^thframe of the model signal and the n^thframe of the reference signal less audio signals representing sound from sources other than an n^thframe of the audio signals representing sound produced by the one or more speakers in playing back the n^thframe of the reference signal;

determining a normalized least mean square (NMLS) of the n^thframe of the error signal;

determining a sparse NMLS of the n^thframe of the error signal by applying to the NMLS of the n^thframe of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy;

converting the sparse NMLS of the n^thframe of the error signal to an n^thupdate filter; and

generating the n+1^thinstance of the adaptive filter for the next iteration of the AEC by summing the n^thinstance of the adaptive filter with the n^thupdate filter; and

sending the output signal as a voice input to one or more voice services for processing of the voice input.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Example techniques involve noise-robust acoustic echo cancellation. An example implementation may involve causing one or more speakers of the playback device to play back audio content and while the audio content is playing back, capturing, via the one or more microphones, audio within an acoustic environment that includes the audio playback. The example implementation may involve determining measured and reference signals in the STFT domain. During each n^thiteration of an acoustic echo canceller (AEC): the implementation may involve determining a frame of an output signal by generating a frame of a model signal by passing a frame of the reference signal through an instance of an adaptive filter and then redacting the n^thframe of the model signal from an n^thframe of the measured signal. The implementation may further involve determining an instance of the adaptive filter for a next iteration of the AEC.

Citations

20 Claims

1. A system comprising:
- an audio stage comprising an audio processor and an audio amplifier;
  
  one or more speakers;
  
  one or more microphones;
  
  one or more processors;
  
  data storage storing instructions executable by the one or more processors that cause the system to perform operations comprising;
  
  causing, via the audio stage, the one or more speakers to play back audio content;
  
  while the audio content is playing back via the one or more speakers, capturing, via the one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content;
  
  receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers;
  
  transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment;
  
  transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers;
  
  during each n^thiteration of an acoustic echo canceller (AEC);
  
  determining an n^thframe of an output signal, wherein determining the n^thframe of the output signal comprises;
  
  generating an n^thframe of a model signal by passing an n^thframe of the reference signal through an n^thinstance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and
  
  generating the n^thframe of the output signal by redacting the n^thframe of the model signal from an n^thframe of the measured signal;
  
  determining a n+1^thinstance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1^thinstance of the adaptive filter for the next iteration of the AEC comprises;
  
  determining an n^thframe of an error signal, the n^thframe of the error signal representing a difference between the n^thframe of the model signal and the n^thframe of the reference signal less audio signals representing sound from sources other than an n^thframe of the audio signals representing sound produced by the one or more speakers in playing back the n^thframe of the reference signal;
  
  determining a normalized least mean square (NMLS) of the n^thframe of the error signal;
  
  determining a sparse NMLS of the n^thframe of the error signal by applying to the NMLS of the n^thframe of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy;
  
  converting the sparse NMLS of the n^thframe of the error signal to an n^thupdate filter; and
  
  generating the n+1^thinstance of the adaptive filter for the next iteration of the AEC by summing the n^thinstance of the adaptive filter with the n^thupdate filter; and
  
  sending the output signal as a voice input to one or more voice services for processing of the voice input.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system of claim 1, wherein the data storage further includes instructions that cause the system to perform operations comprising:
    - before determining the NMLS of the n^thframe of the error signal, applying an error recovery non-linearity function to the error signal to limit the error signal to a threshold magnitude, wherein determining the normalized least mean square (NMLS) of the n^thframe of the error signal comprises determining the NMLS of the n^thframe of the limited error signal.
  - 3. The system of claim 2, wherein the error recovery non-linearity function comprises a non-linear clipping function that limits portions of the error signal that are above the threshold magnitude to the threshold magnitude.
  - 4. The system of claim 1, wherein determining the normalized least mean square (NMLS) of the n^thframe of the error signal comprises:
    - applying a frequency-dependent regularization parameter to adapt an NMLS learning rate of change between AEC iterations according to a magnitude of the measured signal.
  - 5. The system of claim 1, wherein converting the sparse NMLS of the n^thframe of the error signal to the n^thupdate filter comprises:
    - converting the sparse NMLS of the n^thframe to a matrix of filter coefficients; and
      
      cross-band filtering the matrix of filter coefficients to generate the n^thupdate filter.
  - 6. The system of claim 1, excluding a double-talk detector that disables the AEC when a double-talk condition is detected, wherein capturing audio within the acoustic environment comprises capturing audio signals representing sound produced by two or more voices.
  - 7. The system of claim 1, further comprising:
    - a playback device comprising a first network interface and the one or more speakers; and
      
      a networked-microphone device comprising a second network interface, the one or more microphones, the one or more processors, and the data storage storing instructions executable by the one or more processors,wherein the first network interface and the second network interface are configured to communicatively couple the playback device and the networked-microphone device.
  - 8. The system of claim 1, further comprising:
    - a playback device comprising a housing configured to house the one or more speakers and the one or more microphones.

9. A method to be performed by a system comprising a playback device, the method comprising:
- causing, via an audio stage of the playback device, one or more speakers of the playback device to play back audio content, wherein the audio stage comprises an audio processor and an audio amplifier;
  
  while the audio content is playing back via the one or more speakers, capturing, via one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content;
  
  receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers;
  
  transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment;
  
  transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers;
  
  during each n^thiteration of an acoustic echo canceller (AEC);
  
  determining an n^thframe of an output signal, wherein determining the n^thframe of the output signal comprises;
  
  generating an n^thframe of a model signal by passing an n^thframe of the reference signal through an n^thinstance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and
  
  generating the n^thframe of the output signal by redacting the n^thframe of the model signal from an n^thframe of the measured signal;
  
  determining a n+1^thinstance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1^thinstance of the adaptive filter for the next iteration of the AEC comprises;
  
  determining an n^thframe of an error signal, the n^thframe of the error signal representing a difference between the n^thframe of the model signal and the n^thframe of the reference signal less audio signals representing sound from sources other than an n^thframe of the audio signals representing sound produced by the one or more speakers in playing back the n^thframe of the reference signal;
  
  determining a normalized least mean square (NMLS) of the n^thframe of the error signal;
  
  determining a sparse NMLS of the n^thframe of the error signal by applying to the NMLS of the n^thframe of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy;
  
  converting the sparse NMLS of the n^thframe of the error signal to an n^thupdate filter; and
  
  generating the n+1^thinstance of the adaptive filter for the next iteration of the AEC by summing the n^thinstance of the adaptive filter with the n^thupdate filter; and
  
  sending the output signal as a voice input to one or more voice services for processing of the voice input.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The method of claim 9, further comprising:
    - before determining the NMLS of the n^thframe of the error signal, applying an error recovery non-linearity function to the error signal to limit the error signal to a threshold magnitude, wherein determining the normalized least mean square (NMLS) of the n^thframe of the error signal comprises determining the NMLS of the n^thframe of the limited error signal.
  - 11. The method of claim 10, wherein the error recovery non-linearity function comprises a non-linear clipping function that limits portions of the error signal that are above the threshold magnitude to the threshold magnitude.
  - 12. The method of claim 9, wherein determining the normalized least mean square (NMLS) of the n^thframe of the error signal comprises:
    - applying a frequency-dependent regularization parameter to adapt an NMLS learning rate of change between AEC iterations according to a magnitude of the measured signal.
  - 13. The method of claim 9, wherein converting the sparse NMLS of the n^thframe of the error signal to the n^thupdate filter comprises:
    - converting the sparse NMLS of the n^thframe to a matrix of filter coefficients; and
      
      cross-band filtering the matrix of filter coefficients to generate the n^thupdate filter.
  - 14. The method of claim 9, wherein the system excludes a double-talk detector that disables the AEC when a double-talk condition is detected, wherein capturing audio within the acoustic environment comprises capturing audio signals representing sound produced by two or more voices.

15. A tangible, non-transitory, computer-readable media having stored therein instructions executable by one or more processors to cause a system to perform operations comprising:
- causing, via an audio stage of a playback device, one or more speakers of the playback device to play back audio content, wherein the audio stage comprises an audio processor and an audio amplifier;
  
  while the audio content is playing back via the one or more speakers, capturing, via one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content;
  
  receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers;
  
  transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment;
  
  transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers;
  
  during each n^thiteration of an acoustic echo canceller (AEC);
  
  determining an n^thframe of an output signal, wherein determining the n^thframe of the output signal comprises;
  
  generating an n^thframe of a model signal by passing an n^thframe of the reference signal through an n^thinstance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and
  
  generating the n^thframe of the output signal by redacting the n^thframe of the model signal from an n^thframe of the measured signal;
  
  determining a n+1^thinstance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1^thinstance of the adaptive filter for the next iteration of the AEC comprises;
  
  determining an n^thframe of an error signal, the n^thframe of the error signal representing a difference between the n^thframe of the model signal and the n^thframe of the reference signal less audio signals representing sound from sources other than an n^thframe of the audio signals representing sound produced by the one or more speakers in playing back the n^thframe of the reference signal;
  
  determining a normalized least mean square (NMLS) of the n^thframe of the error signal;
  
  determining a sparse NMLS of the n^thframe of the error signal by applying to the NMLS of the n^thframe of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy;
  
  converting the sparse NMLS of the n^thframe of the error signal to an n^thupdate filter; and
  
  generating the n+1^thinstance of the adaptive filter for the next iteration of the AEC by summing the n^thinstance of the adaptive filter with the n^thupdate filter; and
  
  sending the output signal as a voice input to one or more voice services for processing of the voice input.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The tangible, non-transitory, computer-readable media of claim 15, wherein the computer-readable media further includes instructions executable by the one or more processors to perform operations comprising:
    - before determining the NMLS of the n^thframe of the error signal, applying an error recovery non-linearity function to the error signal to limit the error signal to a threshold magnitude, wherein determining the normalized least mean square (NMLS) of the n^thframe of the error signal comprises determining the NMLS of the n^thframe of the limited error signal.
  - 17. The tangible, non-transitory, computer-readable media of claim 16, wherein the error recovery non-linearity function comprises a non-linear clipping function that limits portions of the error signal that are above the threshold magnitude to the threshold magnitude.
  - 18. The tangible, non-transitory, computer-readable media of claim 15, wherein determining the normalized least mean square (NMLS) of the n^thframe of the error signal comprises:
    - applying a frequency-dependent regularization parameter to adapt an NMLS learning rate of change between AEC iterations according to a magnitude of the measured signal.
  - 19. The tangible, non-transitory, computer-readable media of claim 15, wherein converting the sparse NMLS of the n^thframe of the error signal to the n^thupdate filter comprises:
    - converting the sparse NMLS of the n^thframe to a matrix of filter coefficients; and
      
      cross-band filtering the matrix of filter coefficients to generate the n^thupdate filter.
  - 20. The tangible, non-transitory, computer-readable media of claim 15, wherein the system excludes a double-talk detector that disables the AEC when a double-talk condition is detected, wherein capturing audio within the acoustic environment comprises capturing audio signals representing sound produced by two or more voices.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sonos, Inc.
Original Assignee
Sonos, Inc.
Inventors
Giacobello, Daniele
Primary Examiner(s)
Zhang, Leshui

Application Number

US15/717,621
Publication Number

US 20190096419A1
Time in Patent Office

748 Days
Field of Search

381 17, 381 18, 381 19, 381 20, 381 21, 381300, 381301, 381302, 381303, 381307, 381119, 381 66, 381 27, 381 711- 716, 381 719, 381 7111, 381 7112, 381 26, 381318, 381 86, 381 92, 381 941, 381 93, 381 95, 381 96, 381122, 381123, 37940601-40616, 704E19014, 704E21002, 704E21004, 704E21007, 4555691, 455570
US Class Current
CPC Class Codes

G10K 11/178   by electro-acoustically reg...

G10K 2210/3012   Algorithms

G10K 2210/3028   Filtering, e.g. Kalman filt...

G10K 2210/505   Echo cancellation, e.g. mul...

G10L 2021/02087   the noise being separate sp...

G10L 21/02   Speech enhancement, e.g. no...

G10L 21/0208   Noise filtering

G10L 21/0232   Processing in the frequency...

H04M 9/082   using echo cancellers echo ...

H04R 2227/003   Digital PA systems using, e...

H04R 2227/005   Audio distribution systems ...

H04R 2420/03   Connection circuits to sele...

H04R 2420/07   Applications of wireless lo...

H04R 2430/23   Direction finding using a s...

H04R 27/00   Public address systems circ...

H04R 29/007   for public address systems ...

H04R 3/005   for combining the signals o...

H04R 3/12   for distributing signals to...

Robust short-time fourier transform acoustic echo cancellation during audio playback

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Robust short-time fourier transform acoustic echo cancellation during audio playback

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links