Robust short-time fourier transform acoustic echo cancellation during audio playback
First Claim
1. A system comprising:
- an audio stage comprising an audio processor and an audio amplifier;
one or more speakers;
one or more microphones;
one or more processors;
data storage storing instructions executable by the one or more processors that cause the system to perform operations comprising;
causing, via the audio stage, the one or more speakers to play back audio content;
while the audio content is playing back via the one or more speakers, capturing, via the one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content;
receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers;
transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment;
transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers;
during each nth iteration of an acoustic echo canceller (AEC);
determining an nth frame of an output signal, wherein determining the nth frame of the output signal comprises;
generating an nth frame of a model signal by passing an nth frame of the reference signal through an nth instance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and
generating the nth frame of the output signal by redacting the nth frame of the model signal from an nth frame of the measured signal;
determining a n+1th instance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1th instance of the adaptive filter for the next iteration of the AEC comprises;
determining an nth frame of an error signal, the nth frame of the error signal representing a difference between the nth frame of the model signal and the nth frame of the reference signal less audio signals representing sound from sources other than an nth frame of the audio signals representing sound produced by the one or more speakers in playing back the nth frame of the reference signal;
determining a normalized least mean square (NMLS) of the nth frame of the error signal;
determining a sparse NMLS of the nth frame of the error signal by applying to the NMLS of the nth frame of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy;
converting the sparse NMLS of the nth frame of the error signal to an nth update filter; and
generating the n+1th instance of the adaptive filter for the next iteration of the AEC by summing the nth instance of the adaptive filter with the nth update filter; and
sending the output signal as a voice input to one or more voice services for processing of the voice input.
4 Assignments
0 Petitions
Accused Products
Abstract
Example techniques involve noise-robust acoustic echo cancellation. An example implementation may involve causing one or more speakers of the playback device to play back audio content and while the audio content is playing back, capturing, via the one or more microphones, audio within an acoustic environment that includes the audio playback. The example implementation may involve determining measured and reference signals in the STFT domain. During each nth iteration of an acoustic echo canceller (AEC): the implementation may involve determining a frame of an output signal by generating a frame of a model signal by passing a frame of the reference signal through an instance of an adaptive filter and then redacting the nth frame of the model signal from an nth frame of the measured signal. The implementation may further involve determining an instance of the adaptive filter for a next iteration of the AEC.
-
Citations
20 Claims
-
1. A system comprising:
-
an audio stage comprising an audio processor and an audio amplifier; one or more speakers; one or more microphones; one or more processors; data storage storing instructions executable by the one or more processors that cause the system to perform operations comprising; causing, via the audio stage, the one or more speakers to play back audio content; while the audio content is playing back via the one or more speakers, capturing, via the one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content; receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers; transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment; transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers; during each nth iteration of an acoustic echo canceller (AEC); determining an nth frame of an output signal, wherein determining the nth frame of the output signal comprises; generating an nth frame of a model signal by passing an nth frame of the reference signal through an nth instance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and generating the nth frame of the output signal by redacting the nth frame of the model signal from an nth frame of the measured signal; determining a n+1th instance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1th instance of the adaptive filter for the next iteration of the AEC comprises; determining an nth frame of an error signal, the nth frame of the error signal representing a difference between the nth frame of the model signal and the nth frame of the reference signal less audio signals representing sound from sources other than an nth frame of the audio signals representing sound produced by the one or more speakers in playing back the nth frame of the reference signal; determining a normalized least mean square (NMLS) of the nth frame of the error signal; determining a sparse NMLS of the nth frame of the error signal by applying to the NMLS of the nth frame of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy; converting the sparse NMLS of the nth frame of the error signal to an nth update filter; and generating the n+1th instance of the adaptive filter for the next iteration of the AEC by summing the nth instance of the adaptive filter with the nth update filter; and sending the output signal as a voice input to one or more voice services for processing of the voice input. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method to be performed by a system comprising a playback device, the method comprising:
-
causing, via an audio stage of the playback device, one or more speakers of the playback device to play back audio content, wherein the audio stage comprises an audio processor and an audio amplifier; while the audio content is playing back via the one or more speakers, capturing, via one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content; receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers; transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment; transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers; during each nth iteration of an acoustic echo canceller (AEC); determining an nth frame of an output signal, wherein determining the nth frame of the output signal comprises; generating an nth frame of a model signal by passing an nth frame of the reference signal through an nth instance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and generating the nth frame of the output signal by redacting the nth frame of the model signal from an nth frame of the measured signal; determining a n+1th instance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1th instance of the adaptive filter for the next iteration of the AEC comprises; determining an nth frame of an error signal, the nth frame of the error signal representing a difference between the nth frame of the model signal and the nth frame of the reference signal less audio signals representing sound from sources other than an nth frame of the audio signals representing sound produced by the one or more speakers in playing back the nth frame of the reference signal; determining a normalized least mean square (NMLS) of the nth frame of the error signal; determining a sparse NMLS of the nth frame of the error signal by applying to the NMLS of the nth frame of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy; converting the sparse NMLS of the nth frame of the error signal to an nth update filter; and generating the n+1th instance of the adaptive filter for the next iteration of the AEC by summing the nth instance of the adaptive filter with the nth update filter; and sending the output signal as a voice input to one or more voice services for processing of the voice input. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A tangible, non-transitory, computer-readable media having stored therein instructions executable by one or more processors to cause a system to perform operations comprising:
-
causing, via an audio stage of a playback device, one or more speakers of the playback device to play back audio content, wherein the audio stage comprises an audio processor and an audio amplifier; while the audio content is playing back via the one or more speakers, capturing, via one or more microphones, audio within an acoustic environment, wherein the captured audio comprises audio signals representing sound produced by the one or more speakers in playing back the audio content; receiving a playback signal from the audio stage representing the audio content being played back by the one or more speakers; transforming into a short time Fourier transform (STFT) domain the captured audio within the acoustic environment to generate a measured signal in the STFT domain comprising a series of frames representing the captured audio within the acoustic environment; transforming into the STFT domain the received output signal from the audio stage to generate a reference signal in the STFT domain comprising a series of frames representing the audio content being played back via the one or more speakers; during each nth iteration of an acoustic echo canceller (AEC); determining an nth frame of an output signal, wherein determining the nth frame of the output signal comprises; generating an nth frame of a model signal by passing an nth frame of the reference signal through an nth instance of an adaptive filter, wherein the first instance of the adaptive filter is an initial filter; and generating the nth frame of the output signal by redacting the nth frame of the model signal from an nth frame of the measured signal; determining a n+1th instance of the adaptive filter for a next iteration of the AEC, wherein determining the n+1th instance of the adaptive filter for the next iteration of the AEC comprises; determining an nth frame of an error signal, the nth frame of the error signal representing a difference between the nth frame of the model signal and the nth frame of the reference signal less audio signals representing sound from sources other than an nth frame of the audio signals representing sound produced by the one or more speakers in playing back the nth frame of the reference signal; determining a normalized least mean square (NMLS) of the nth frame of the error signal; determining a sparse NMLS of the nth frame of the error signal by applying to the NMLS of the nth frame of the error signal, a sparse partition criterion that zeroes out frequency bands of the NMLS having less than a threshold energy; converting the sparse NMLS of the nth frame of the error signal to an nth update filter; and generating the n+1th instance of the adaptive filter for the next iteration of the AEC by summing the nth instance of the adaptive filter with the nth update filter; and sending the output signal as a voice input to one or more voice services for processing of the voice input. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification