Online dereverberation algorithm based on weighted prediction error for noisy timevarying environments

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
0Forward
Citations 
0
Petitions 
2
Assignments
First Claim
1. A method for processing multichannel audio signals comprising:
 receiving an input signal comprising a timedomain, multichannel audio signal;
transforming the input signal to a frequency domain input signal comprising a plurality of multichannel frequency domain, kspaced undersampled subband signals;
buffering and delaying each channel of the frequency domain input signal;
saving a subset of spectral frames for prediction filter estimation at each of the spectral frames;
estimating a variance of the frequency domain input signal at each of the spectral frames;
adaptively estimating a prediction filter in an online manner by using a recursive least squares (RLS) algorithm and a cost function based at least in part on the estimated variance;
linearly filtering each channel of the frequency domain input signal to reduce reverberation using the estimated prediction filter to produce a linearly filtered output signal;
nonlinearly filtering the linearly filtered output signal to reduce residual reverberation using the estimated variances, producing a nonlinearly filtered output signal; and
synthesizing the nonlinearly filtered output signal to reconstruct a dereverberated timedomain, multichannel audio signal, wherein a number of output channels is equal to a number of input channels.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for processing multichannel audio signals include receiving a multichannel timedomain audio input, transforming the input signal to plurality of multichannel frequency domain, kspaced undersampled subband signals, buffering and delaying each channel, saving a subset of spectral frames for prediction filter estimation at each of the spectral frames, estimating a variance of the frequency domain signal at each of the spectral frames, adaptively estimating the prediction filter in an online manner using a recursive least squares (RLS) algorithm, linearly filtering each channel using the estimated prediction filter, nonlinearly filtering the linearly filtered output signal to reduce residual reverberation and the estimated variances, producing a nonlinearly filtered output signal, and synthesizing the nonlinearly filtered output signal to reconstruct a dereverberated timedomain multichannel audio signal.
11 Citations
No References
DEREVERBERATION APPARATUS, DEREVERBERATION METHOD, DEREVERBERATION PROGRAM, AND RECORDING MEDIUM  
Patent #
US 20110002473A1
Filed 02/27/2009

Current Assignee
Nippon Telegraph and Telephone Corporation

Sponsoring Entity
Nippon Telegraph and Telephone Corporation

METHOD AND SYSTEM FOR REDUCING ACOUSTICAL REVERBERATIONS IN AN AT LEAST PARTIALLY ENCLOSED SPACE  
Patent #
US 20110129096A1
Filed 11/30/2009

Current Assignee
Emmet Raftery

Sponsoring Entity
Emmet Raftery

HEARING AID SYSTEM WITH FEEDBACK ARRANGEMENT TO PREDICT AND CANCEL ACOUSTIC FEEDBACK, METHOD AND USE  
Patent #
US 20100254555A1
Filed 09/30/2008

Current Assignee
Oticon AS

Sponsoring Entity
Oticon AS

Noise Eliminating Apparatus  
Patent #
US 20090214054A1
Filed 03/07/2006

Current Assignee
TOA Corporation

Sponsoring Entity
TOA Corporation

CONTROL SYSTEM  
Patent #
US 20090271005A1
Filed 04/24/2009

Current Assignee
Music Group IP Limited

Sponsoring Entity
Tannoy Limited

Multiinput channel and multioutput channel echo cancellation  
Patent #
US 20060002546A1
Filed 06/10/2005

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Corporation

Microphone array signal enhancement  
Patent #
US 20030206640A1
Filed 05/02/2002

Current Assignee
Microsoft Technology Licensing LLC

Sponsoring Entity
Microsoft Corporation

SYSTEM FOR MODIFYING AN ACOUSTIC SPACE WITH AUDIO SOURCE CONTENT  
Patent #
US 20120275613A1
Filed 07/09/2012

Current Assignee
Harman International Industries Incorporated

Sponsoring Entity
Harman International Industries Incorporated

DEREVERBERATION PARAMETER ESTIMATION DEVICE AND METHOD, DEREVERBERATION/ECHOCANCELLATION PARAMETERESTIMATIONDEVICE,DEREVERBERATIONDEVICE,DEREVERBERATION/ECHOCANCELLATION DEVICE, AND DEREVERBERATION DEVICE ONLINE CONFERENCING SYSTEM  
Patent #
US 20150016622A1
Filed 02/15/2013

Current Assignee
Hitachi America Limited

Sponsoring Entity
Hitachi America Limited

ACTIVE NOISE REDUCTION DEVICE AND ACTIVE NOISE REDUCTION METHOD  
Patent #
US 20150063581A1
Filed 06/25/2013

Current Assignee
Panasonic Intellectual Property Management Co. Ltd.

Sponsoring Entity
Panasonic Intellectual Property Management Co. Ltd.

Selective Audio Source Enhancement  
Patent #
US 20150117649A1
Filed 10/06/2014

Current Assignee
Synaptics Incorporated

Sponsoring Entity
Synaptics Incorporated

20 Claims
 1. A method for processing multichannel audio signals comprising:
receiving an input signal comprising a timedomain, multichannel audio signal; transforming the input signal to a frequency domain input signal comprising a plurality of multichannel frequency domain, kspaced undersampled subband signals; buffering and delaying each channel of the frequency domain input signal; saving a subset of spectral frames for prediction filter estimation at each of the spectral frames; estimating a variance of the frequency domain input signal at each of the spectral frames; adaptively estimating a prediction filter in an online manner by using a recursive least squares (RLS) algorithm and a cost function based at least in part on the estimated variance; linearly filtering each channel of the frequency domain input signal to reduce reverberation using the estimated prediction filter to produce a linearly filtered output signal; nonlinearly filtering the linearly filtered output signal to reduce residual reverberation using the estimated variances, producing a nonlinearly filtered output signal; and synthesizing the nonlinearly filtered output signal to reconstruct a dereverberated timedomain, multichannel audio signal, wherein a number of output channels is equal to a number of input channels.  View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
 12. An audio processing system comprising:
an audio input operable to receive a timedomain, multichannel audio signal; a subband decomposition module operable to transform the input signal to a frequency domain input signal comprising a plurality of multichannel frequency domain, kspaced undersampled subband signals; a buffer operable to buffer and delay each channel of the frequency domain input signal, saving a subset of spectral frames for prediction filter estimation at each of the spectral frames; a variance estimator operable to estimate a variance of the frequency domain input signal at each of the spectral frames; a prediction filter estimator operable to adaptively estimate the prediction filter in an online manner by using a recursive least squares (RLS) algorithm having a cost function based at least in part on the estimated variance; a linear filter operable to linearly filter each channel of the frequency domain input signal to reduce reverberation using the estimated prediction filter to produce a linearly filtered output signal; a nonlinear filter operable to nonlinearly filter the linearly filtered output signal to reduce residual reverberation using the estimated variances, producing a nonlinearly filtered output signal; and a synthesizer operable to synthesize the nonlinearly filtered output signal to reconstruct a dereverberated timedomain, multichannel audio signal, wherein a number of output channels is equal to a number of input channels.  View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
1 Specification
This application claims the benefit of and priority to U.S. Provisional Patent Application No. 62/438,860 filed Dec. 23, 2016, and entitled “ONLINE DEREVERBERATION ALGORITHM BASED ON WEIGHTED PREDICTION ERROR FOR NOISY TIMEVARYING ENVIRONMENTS,” which is incorporated herein by reference in its entirety.
The present application relates generally to audio processing, and more specifically to dereverberation of multichannel audio signals.
Reverberation reduction solutions are known in the field of audio signal processing. Many conventional approaches are not suitable for use in realtime applications. For example, a reverberation reduction solution may require a long buffer of data to compensate for the effect of reverberation or to estimate an inverse filter of the Room Impulse Responses (RIR). Approaches that are suitable for realtime applications do not perform reasonably well in high reverberation and especially high nonstationary environments. In addition, such solutions require a large amount of memory and it is not computationally efficient for many low power devices.
One conventional solution is based on weighted prediction error (WPE), which assumes an autoregressive model of the reverberation process, i.e., it is assumed that the reverberant component at a certain time can be predicted from previous samples of reverberant microphone signals. The desired signal can be estimated as the prediction error of the model. A fixed delay is introduced to avoid distortion of the shorttime correlation of the speech signal. This algorithm is not suitable for realtime processing and does not explicitly model the input signal in noisy conditions. Also, the WPE method, has high complexity and is not Online multipleinput multipleoutput (MIMO) solution. The WPE approach has been extended for MIMO and generalized for use in noisy condition. However, such modifications are not suitable for timevarying environments. Further modifications for timevarying environments have been proposed, which include both WPE for linear filtering and an optimum combination of the beamforming and a Wienerfilteringbased nonlinear filtering. However, such proposals are still not realtime and are not suitable for use in low power devices because of its high complexity.
Generally, conventional methods have limitations in complexity and practicality for use in online and realtime applications. Unlike batch processing, a realtime or online processing is used in industry for many practical applications. There is therefore a need for improved systems and methods for online and realtime dereverberation.
Systems and methods including embodiments for online dereverberation based on weighted prediction error for noisy timevarying environments are disclosed. In various embodiments, method for processing multichannel audio signals includes receiving an input signal comprising a timedomain, multichannel audio signal, transforming the input signal to a frequency domain input signal comprising a plurality of multichannel frequency domain, kspaced undersampled subband signals, buffering and delaying each channel of the frequency domain input signal, saving a subset of spectral frames for prediction filter estimation at each of the spectral frames, and estimating a variance of the frequency domain input signal at each of the spectral frames, adaptively estimating the prediction filter in an online manner, by using a recursive least squares (RLS) algorithm. The method further includes linearly filtering each channel of the frequency domain input signal using the estimated prediction filter to produce a linearly filtered output signal, nonlinearly filtering the linearly filtered output signal to reduce residual reverberation and the estimated variances, producing a nonlinearly filtered output signal, and synthesizing the nonlinearly filtered output signal to reconstruct a dereverberated timedomain, multichannel audio signal, wherein a number of output channels is equal to a number of input channels.
In various embodiments, the method may further include estimating the variance of the frequency domain input signal further comprises estimating a clean speech variance, estimating a noise variance, and/or estimating a residual speech variance. In various embodiments, the method may further include using an adaptive RLS algorithm to estimate the prediction filter at each frame independently for each frequency bin of the frequency domain input signal by imposing sparsity to a correlation matrix.
In various embodiments, the input signal comprises at least one target signal, and the nonlinear filtering computes an enhanced speech signal for each target signal to reduce residual reverberation and background noise. The variance estimation process may include estimating a new clean speech variance based on a previous estimated prediction filter, estimating a new residual reverberation variance using a fixed exponentially decaying weighting function with a tuning parameter to customize an audio solution, and estimating a noise variance using a singlemicrophone noise variance estimation method to estimate the noise variance for each channel and then compute an average. The method may also detect sudden changes to reset the prediction filter and correlation matrix in the event of speaker movement.
In various embodiments, an audio processing system includes an audio input, a subband decomposition module, a buffer, a variance estimator, a prediction filter estimator, a linear filter, a nonlinear filter and a synthesizer. The audio input is operable to receive a timedomain, multichannel audio signal. The subband decomposition module is operable to transform the input signal to a frequency domain input signal comprising a plurality of multichannel frequency domain, kspaced undersampled subband signals. The buffer is operable to buffer and delay each channel of the frequency domain input signal, saving a subset of spectral frames for prediction filter estimation at each of the spectral frames.
In various embodiments, the variance estimator is operable to estimate a variance of the frequency domain input signal at each of the spectral frames. The variance estimator may be further operable to estimate a clean speech variance, a noise variance, and/or a residual speech variance. The variance estimator may be further operable to estimate a new clean speech variance based on a previous estimated prediction filter, estimate a new residual reverberation variance using a fixed exponentially decaying weighting function with a tuning parameter to customize an audio solution, and estimate a noise variance using a singlemicrophone noise variance estimation method to estimate the noise variance for each channel and then computing an average. The variance estimator may be further operable to detect changes due to speaker movement and to reset the prediction filter and the correlation matrix.
In one or more embodiments, the prediction filter estimator is operable to adaptively estimate the prediction filter on an online manner, by using a recursive least squares (RLS) algorithm. The prediction filter may be further operable to use an adaptive RLS algorithm to estimate the prediction filter at each frame independently for each frequency bin of the frequency domain input signal by imposing sparsity to a correlation matrix.
In various embodiments, the linear filter is operable to linearly filter each channel of the frequency domain input signal using the estimated prediction filter to produce a linearly filtered output signal. The nonlinear filter is operable to nonlinearly filter the linearly filtered output signal to reduce residual reverberation and the estimated variances, producing a nonlinearly filtered output signal. In one embodiment, the timedomain, multichannel audio signal comprises at least one target signal and the nonlinear filter is further operable to compute an enhanced speech signal for each target signal, and reduce residual reverberation and background noise. The synthesizer is operable to synthesize the nonlinearly filtered output signal to reconstruct a dereverberated timedomain, multichannel audio signal, wherein a number of output channels is equal to a number of input channels.
The scope of the invention is defined by the claims, which are incorporated into this section by reference. A more complete understanding of embodiments of the invention will be afforded to those skilled in the art, as well as a realization of additional advantages thereof, by a consideration of the following detailed description of one or more embodiments. Reference will be made to the appended sheets of drawings that will first be described briefly.
Aspects of the disclosure and their advantages can be better understood with reference to the following drawings and the detailed description that follows. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures, wherein showings therein are for purposes of illustrating embodiments of the present disclosure and not for purposes of limiting the same. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure.
In accordance with various embodiments of the present disclosure, systems and methods for dereverberation of multichannel audio signals are provided.
Generally, conventional methods have limitations in complexity and practicality for use in online and realtime applications. Unlike batch processing, a realtime or online processing has been used in industry for many practical applications. Online adaptive algorithms have been developed for these applications, such as a Recursive Least Squares (RLS) method to develop the adaptive WPE approach, or a Kalman filter approach where a multimicrophone algorithm that simultaneously estimates the clean speech signal and the timevarying acoustic system is used. The recursive expectationmaximization scheme is employed to obtain both the clean speech signal and the acoustic system in an online manner. However, both in the RLSbased and Kalman filter based algorithms, the methods do not perform well in highly nonstationary conditions. In addition, the computational complexity and memory usage for both Kalman and RLS algorithms are unreasonably high for many applications. Plus, despite their fast convergence to the stable solution, the algorithms may be too sensitive to sudden changes and may require a change detector to reset the correlation matrices and filters to their initial values.
Online multipleinput multipleoutput (MIMO) embodiments for dereverberation using subbanddomain are disclosed herein. In various embodiments, multichannel linear prediction filters adapted to blindly shorten the Room Impulse Responses (RIRs) between a set of unknown number of sources and microphones are estimated online. In one embodiment, a RLS algorithm is used for fast convergence. However, some approaches using RLS may be characterized by high computational complexity. In various environments, low computational complexity and low memory consumption may be desired. In various embodiment of systems and methods disclosed herein, memory usage and the computational complexity is reduced by imposing sparsity to a correlation matrix. In one embodiment, a new method is proposed of identifying the movement of a speaker or audio source in timevarying environments, including reinitialization of the prediction filters and improving the convergence speed in timevarying environments.
In various real world environments, a speech source may be mixed with environmental noise. A recorded speech signal typically includes unwanted noise, which can degrade the speech intelligibility for voice applications, such as Voice over IP (VoIP) communications, and can decrease the performance of speech recognition performance of devices such as phones, laptops and voice controlled appliances. One approach to addressing the problem of noise interference is to use a microphone array and beamforming algorithms which can exploit the spatial diversity of noise sources to detect or extract desired source signals and to suppress unwanted interference. Beamforming represents a class of such multichannel signal processing algorithms and suggests a spatial filtering which points a beam of increased sensitivity to desired source locations while suppressing signals originating from other locations.
In indoor environments, the noise suppression approaches may be more effective as the signal source is closer to the microphones, which may be referred to as a nearfield scenario. However, noise suppression may be more complicated when the distance between source and microphones is increased.
Referring to
The performance of many microphone array processing techniques, such as sound source localization, beamforming and Automatic Speech Recognition (ASR) may be sensibly degraded in reverberant environments, such as illustrated in
Conventional methods for addressing reverberation have limitations that make the methods unsuitable for many applications. For example, computational complexity may render an algorithm impractical for many realworld cases that require realtime, online processing. Such algorithms may also require high memory consumption that is not suitable for embedded devices that may require memory efficient algorithms. In a real environment, the reverberant speech signals are usually contaminated with nonstationary additive background noise, which can greatly deteriorate the performance of dereverberation algorithms that do not explicitly address the nonstationary noise in their model. Many dereverberation methods use batch approaches that require a large amount of input data to result in a good performance. However, in applications such as VoIP and hearing aids, I/O latency is undesirable.
Many conventional dereverberation methods produce a smaller number of dereverberated signals as microphones in an input microphone array, and do not conserve the time differences of arrival (TDOAs) at various microphone positions. In some applications, however, source localization algorithms may be explicitly or implicitly based on TDOAs at microphone positions. Other drawbacks of conventional dereverberation methods may include algorithms that require knowledge of the number of sound sources and methods that do not converge fast, thus making the algorithm slow to respond to new changes.
The embodiments disclosed herein address limitations of conventional systems providing solutions for use in different applications in industry. In one embodiment, an algorithm provides fast convergence and no latency which makes it desirable for applications like VOIP. A blind method uses multichannel input signals for shortening a MIMO RIR between a set of unknown number of sources. Subbanddomain multichannel linear prediction filters are used and the algorithm estimates the filter for each frequency band independently. One advantage of this method is that it can conserve TDOAs at microphone positions as well as the linear relationship between sources and microphones which is beneficial if it is required to do further processing for localization and reduction of the noise and interference. In addition, the algorithm can yield as many dereverberated signals as microphones by estimating the prediction filter for each microphone separately. Additive background noise may also be considered in the model to adaptively estimate the prediction filter in an onlinemanner using an adaptive algorithm. In this manner, the algorithm may adaptively estimate the Power Spectral Density (PSD) of the noise.
Embodiments of the present disclosure provide numerous advantages over conventional approaches. Various embodiments provide realtime dereverberation with no latency. A MIMO algorithm is disclosed so it can be easily integrated with other multichannel signal processing blocks, e.g. for doing noise reduction or source location. Embodiments disclosed herein are memory and computational efficient requiring less MIPS. The solutions are robust to timevarying environments and are fast to converge. In various embodiments, nonlinear filtering may be skipped to further reduce the noise and the residual reverberation, allowing the algorithm to provide linear processing which may be critical for some applications which require the linearity. The solutions are robust to nonstationary noise and can perform well in high reverberant conditions. The solutions can be both singlechannel and multichannel, and can be extended for the case of more than one source.
Embodiments of the present disclosure will now be described. As illustrated in
Audio signals 202 received from an array of microphones are provided to subband decomposition module 210, which performs a subband analysis to transform time domain signals in subband frames. The buffer 220 stores the last L_{k }frames of subband signals for all the channels (the number of past frames is subband dependent). The variance estimation component 230 which estimates the variance of the current frame to be used for prediction filter estimation and nonlinear filtering. The prediction filter estimation component 240 uses an adaptive online approach that is fast to converge. The linear filtering component 250 reduces most of the reverberation. The nonlinear filtering component 260 reduces the residual reverberation and noise. The synthesizer 270 transforms the enhanced subband domain signals to timedomain.
In operation, the microphone array 202 receives a plurality of input signals. Assume the input signal for ith channel is denoted by x_{i}[n], where i=1 . . . M, with M being the is the number of microphones that sense a number of different audio sources, N_{s}. Then the input signal can be modeled as
s[n]→[s_{1}[n] . . . s_{N}_{s}[n]]^{T }a vector of all sources (clean speech)
h_{i}[n]→[h_{i1}[n] . . . h_{iN}_{s}[n]] Room Impulse Response (RIR) between the ith microphone and each source
v_{i}[n]→Background noise for ith microphone
The received signal in Time Fourier Transformation (STFT) domain can be approximately modeled as
where L_{i }is the length of the RIR in the STFT domain, l is the frame index, and k is the frequencybin index. The ith received input signal can be separated into the early reflection part (desired signal) and the late reverberation part as
where D is the taplength of the early reflections. The goal is to extract the first term in (3) (Y_{i}(l,k)) by reducing the second late reverberation term (R_{i}(l,k)) and the third term (V_{i}(l,k)) in noisy condition.
In one or more embodiments, to estimate the late reverberation part, the late reflections of the RIR are estimated along with the source signal. In order to make this task easier, the dereverberation is performed by converting (3) into an easier multichannel autoregressive model as given below.
In (4) the only unknown parameter to be estimated is the prediction filter
(W_{i}(l′,k)=[W_{i1}(l′,k), . . . ,W_{iM}(l′,k)]^{T},M×1 vector and
X(l−l′,k)=[X_{1}(l−l′,k), . . . ,X_{M}(l−l′,k)]^{T},M×1 vector).
In one or more embodiments, to estimate the prediction filter, the Maximum Likelihood (ML) approach is used. In one embodiment, the prediction filter is based on the following assumptions: (1) the received speech signal has a Gaussian Probability Density Function (pdf) and the clean part of the received speech has zero mean with timevarying variance. Also, noise is assumed to have zero mean; (2) the frames of the input signal are independent random variables; and (3) the RIRs do not change or they change slowly.
Considering the above assumptions, the pdf of the input signal for T frames can be written as follows:
Where μ(l,k) is the mean and Σ(l,k) is M×M spatial correlation matrix.
As mentioned above, the ML method is used to estimate the prediction filter and so the ML function using logarithm of the pdf in (5) will be considered as the cost function to be maximized.
According to the above assumptions, the mean can be approximately obtained as
In order to be able to practically estimate the prediction filter in an onlinemanner, it is further assumed that the correlation filter can be approximated by a scaled identity matrix as follows:
Now the variance scale σ(l,k) can be obtained as
Where σ(l,k), σ_{reverb}(l,k), and σ_{noise}(l,k) are the variance of the jth source signal, the residual reverberation variance and the noise variance, respectively.
Equation (6) for the case of singlechannel can be simplified using (8) as weighted Mean Square Error (MSE) optimization problem:
where e(l,k) is the error signal.
In one or more embodiments, to estimate the prediction filter in an onlinemanner, the MSE cost function will be minimized by selecting the prediction filter W_{1}(l′,k), updating the filter as new data arrives. In this embodiment, the Recursive Least Squares (RLS) filter is used to estimate the prediction filter. To do so, the cost function is revised using a forgetting factor (0<λ≤1) as
One goal is to minimize the above cost function in an efficient way and reduce both the noise and the reverberation. Below we will describe a proposed system which is shown in the embodiment of
As shown in
In order to reduce the memory consumption and improve the performance of the system, we use shorter length for higher frequency bins and longer length for lower frequency bins.
After the subband decomposition 220, the input signal for each microphone is provided to the buffer with delay 230, and embodiment of which is shown in
The final cost function for RLS filter update in (11) has a variance σ(l,k) which is estimated by the variance estimator 230. According to (9), the variance has three components.
Referring to
where for the late reverberation we use the current prediction filter.
In step 404, the variances for residual reverberation is estimated. From (12), this variance may be estimated using the following equation:
Where {tilde over (W)}_{l}(l′,k) is the residual late reverberation weights for lth frame which is an unknown parameter. In one embodiment, residual reverberation weights are estimated in an online manner as follows:
Where β and w_{0 }are the forgetting factor (very close to one) and a number for residual weight initialization. ε is a very small number to avoid division by zero. This approach provides good performance in different reverberant environments but it has some drawbacks depending on the implementation. First, it adds additional complexity to the method to estimate the unknown residual reverberation weights for variance estimation. Second, additional memory may be required which is not desirable for many low memory devices (e.g., mobile phones). Third, it is suitable for static environments and the performance may decrease in fast timevarying environments.
To resolve these issues, an alternate approach uses a fixed residual reverberation weight having an exponentially decaying function as given below:
Where b and η are the Rayleigh distribution parameter and a small number in the order of 0.01, respectively. Depending on the number of taps L_{k}, the residual reverberation weights may look like a Gaussian pdf. Experimental results showed this alternate approach is only marginally suboptimal compared, but has lower computational complexity and faster convergence in timevarying environments.
In step 406, the noise variance σ^{υ}(l,k) is estimated using an efficient realtime singlechannel method and the noise variance estimations are averaged over all the channels to obtain a single value for noise variance σ^{υ}(l,k).
Referring back to
Rewriting the mean μ_{i}(l,k) in (7) in vector form provides:
W_{i}(k)=[w_{1}^{i}(0,k), . . . ,w_{1}^{i}(L_{k}−1,k), . . . ,w_{M}^{i}(0,k),w_{M}^{i}(L_{k}−1,k)]^{T }
μ_{i}(l,k)=
Where w_{i}^{1}(k) is the prediction filter for frequency band k and ith channel. Now the error in (11) can be rewritten as:
In one embodiment, in order to estimate W_{i}^{1}(k) in an online manner for lth frame, the prediction filters, W_{i}(k), should be initialized by zero values for all the frequency and channels and then gradient of the cost function in (11) which is a vector of L_{k}*M numbers should be computed. The update rule using RLS algorithm can be summarized as follows:
initialize→w_{m}(0,k)=0 and Φ(0,k)=γI_{M }γ is regularization factor
where Φ(l,k) is a (L_{k}M×L_{k}M) correlation matrix.
In this embodiment, the RLS algorithm has fast convergence rate and it generally outperforms other adaptive algorithms, but it has two drawbacks depending on the application. First, the algorithm has both prediction filters and correlation matrix as the unknown parameters. The correlation matrix is a complex matrix and has K×(L_{k}M×L_{k}M) complex numbers for K frequency bands. This may require a relatively high amount of memory and so the RLS algorithm may not be suitable for certain applications requiring low memory. Also, the computational complexity of this algorithm can be unreasonable for such applications. Second, the RLS algorithm can efficiently convergence towards the exact solution by taking the advantage of the correlation matrix. However, in time varying conditions this might cause of performance issues since the algorithm takes more time to track sudden changes. Below, embodiments providing solutions to both problems are disclosed.
In one embodiment, the complexity of the RLS algorithm is reduced. The correlation matrix given in (19) can be also rewritten as follows:
Computationally, the main part of the update for correlation matrix in (20) is
In (21), it is noted that the most significant components of Φ(l,k) are the main diagonal of A_{L}_{k}_{×L}_{k}, B_{L}_{k}_{×L}_{k }and C_{L}_{k}_{×L}_{k}. The other components have amplitude close to zero. By maintaining these diagonals which are real valued for matrices A_{L}_{k}_{×L}_{k}, B_{L}_{k}_{×L}_{k }and complex valued for C_{L}_{k}_{×L}_{k}, the performance of the RLS algorithm would not significantly affect the results. In one embodiment, the correlation matrix is made sparser by maintaining the values of diagonals as discussed above and zeroing the other components. For example, for the case of twochannels (M=2), this method will decrease the number components of Φ(l,k) for all the frequencies from
Most of the components as mentioned above are now real values, which not only decreases the amount of memory usage but also reduces the numerical complexity since the matrix is sparser and the number of multiplications is reduced.
In another embodiment, the performance of the RLS algorithm in timevarying environments is improved. An online adaptive algorithm employing an RLS algorithm to develop the adaptive WPE approach is described in T. Yoshioka, H. Tachibana, T. Nakatani, M. Miyoshi “Adaptive dereverberation of speech signals with speakerposition change detection” Proc. Int. Conf. Acoust., Speech, Signal Process. (2009), pp. 37333736, which is incorporated herein by reference. As shown in this paper, the RLS algorithm amplifies the signals after each sudden change. To improve the performance of the detection described in his paper, a binary buffer of length N_{f }for each channel is used that is initialized by zeros. This buffer will contain a binary decision for the last N_{f }frames including the current frame. To update this buffer at each frame, the number of frequencies having a negative value for e_{i}(l,k) in (18) (it is called F_{i }for each channel i=1, . . . , M) is counted. F_{i }is compared with a threshold τ_{1}. If F_{i}>τ_{1}, then the buffer is updated with one, otherwise it is set to zero. If the number of ones of this buffer for any channel has exceeded a threshold τ_{2}, then a sudden change is identified. After the detection occurs, the prediction filter and the correlation matrix of the RLS method will be reset to their initial values as it is discussed before.
After the prediction filter is estimated in 240, the input signal in each channel is filtered by linear filter 250. In one embodiment, the prediction filters are calculated as follows:
After the linear filtering, nonlinear filtering 260 is performed as
If it is desired to compute the enhanced speech signal for j^{th }source Ŷ_{i}^{(j)}(l,k) using the nonlinear filtering, then Ŷ_{i}^{(j)}(l,k) is computed as
Where σ_{j}^{s}(l,k) is the corresponding variance for j^{th }source as it is given in (9) and it can be computed using source separation methods as shown in M. Togami, Y. Kawaguchi, R. Takeda, Y. Obuchi, and N. Nukaga, “Optimized speech dereverberation from probabilistic perspective for time varying acoustic transfer function,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 13691380, July 2013, which is incorporated herein by reference in its entirety.
After applying the filtering, the enhanced speech spectrum for each band will be transformed from frequency domain to time domain by applying the overlapadd technique followed by an Inverse Short Time Fast Fourier Transform (ISTFT).
The embodiments described herein are configured for operation with the memory and MIPS limitations of a digital signal processor or other smaller platforms for which known computational solutions are typically impracticable. As a result, the present disclosure provides a robust, dereverberation suitable for use in speech control applications for the consumer electronics market and other related applications. For example, speech control of domestic appliances such as smart TVs using speech commands, voice control applications in the automobile industry and other potential applications can be implemented with the systems described herein. Using the embodiments described herein, automated speech recognition may achieve high performance on an inexpensive device that is capable of suppressing nonstationary interfering noises when the target speaker is at far distance from the microphones.
As shown in
In some embodiments, processor 540 may execute machine readable instructions (e.g., software, firmware, or other instructions) stored in memory 520. In this regard, processor 540 may perform any of the various operations, processes, and techniques described herein. In other embodiments, processor 540 may be replaced and/or supplemented with dedicated hardware components to perform any desired combination of the various techniques described herein. Memory 520 may be implemented as a machine readable medium storing various machine readable instructions and data. For example, in some embodiments, memory 520 may store an operating system, and one or more applications as machine readable instructions that may be read and executed by processor 540 to perform the various techniques described herein. In some embodiments, memory 520 may be implemented as nonvolatile memory (e.g., flash memory, hard drive, solid state drive, or other nontransitory machine readable mediums), volatile memory, or combinations thereof.
In the illustrated embodiment, the modules 522534 are controlled by the processor 540. The subband decomposition module 522 is operable to receive a plurality of audio signals including a target audio signal, and transform each of the received signals into the subband frequency domain. The buffer with delay 524 is operable to receive the plurality of subband frequency domain signals and generates a plurality of buffered outputs. The variance estimation module 526 is operable to estimate variance components for the cost function for the RLS filter as described herein. The prediction filter estimation module 528 is operable to use an adaptive online approach that has fast convergence, in accordance with the embodiments described herein. The linear filter module 530 is operable to reduce the party of the reverberation especially the late reverberation that can be reduced by linear filtering. Nonliner filter module 532 is operable to reduce the residual reverberation and noise from the multichannel audio signal. The synthesis module 534 is operable to transform the enhanced subband domain signal to the timedomain.
There are several advantages to the solution represented by audio processing system 510. First, the solution is a general framework that can be adapted to multiple scenarios and customized to the specific hardware limitations of the computing environment in which it is implemented. The present solution has the ability to run with online processing while delivering performance comparable to more complex stateoftheart offline solutions. For example, it is possible to separate highly reverberated sources even using only two microphones when the microphonesource distance is large. In some implementations, audio processing system 510 may be configured to selectively recognize a source of the target audio signal that is in motion relative to selective audio processing system 510.
The foregoing disclosure is not intended to limit the present invention to the precise forms or particular fields of use disclosed. As such, it is contemplated that various alternate embodiments and/or modifications to the present disclosure, whether explicitly described or implied herein, are possible in light of the disclosure. Having thus described embodiments of the present disclosure, persons of ordinary skill in the art will recognize that changes may be made in form and detail without departing from the scope of the present disclosure. Thus, the present disclosure is limited only by the claims.