SYSTEM AND METHOD FOR PERFORMING SPEECH ENHANCEMENT USING A DEEP NEURAL NETWORK-BASED SIGNAL

US 20180040333A1
Filed: 08/03/2016
Published: 02/08/2018
Est. Priority Date: 08/03/2016
Status: Active Grant

First Claim

Patent Images

1. A system for performing speech enhancement using a Deep Neural Network (DNN)-based signal comprising:

a loudspeaker to output a loudspeaker signal, wherein the loudspeaker is being driven by a reference signal;

at least one microphone to receive at least one of;

a near-end speaker signal, an ambient noise signal, or the loudspeaker signal and to generate a microphone signal;

an acoustic-echo-canceller (AEC) to receive the reference signal and the microphone signal, and to generate an AEC echo-cancelled signal;

a loudspeaker signal estimator to receive the microphone signal and the AEC echo-cancelled signal and to generate an estimated loudspeaker signal; and

a deep neural network (DNN) to receive the microphone signal, the reference signal, the AEC echo-cancelled signal, and the estimated loudspeaker signal, and to generate a clean speech signal,wherein the DNN is trained offline by exciting the at least one microphone using a target training signal that includes a signal approximation of clean speech.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Method for performing speech enhancement using a Deep Neural Network (DNN)-based signal starts with training DNN offline by exciting a microphone using target training signal that includes signal approximation of clean speech. Loudspeaker is driven with a reference signal and outputs loudspeaker signal. Microphone then generates microphone signal based on at least one of: near-end speaker signal, ambient noise signal, or loudspeaker signal. Acoustic-echo-canceller (AEC) generates AEC echo-cancelled signal based on reference signal and microphone signal. Loudspeaker signal estimator generates estimated loudspeaker signal based on microphone signal and AEC echo-cancelled signal. DNN receives microphone signal, reference signal, AEC echo-cancelled signal, and estimated loudspeaker signal and generates a speech reference signal that includes signal statistics for residual echo or for noise. Noise suppressor generates a clean speech signal by suppressing noise or residual echo in the microphone signal based on speech reference signal. Other embodiments are described.

33 Citations

View as Search Results

20 Claims

1. A system for performing speech enhancement using a Deep Neural Network (DNN)-based signal comprising:
- a loudspeaker to output a loudspeaker signal, wherein the loudspeaker is being driven by a reference signal;
  
  at least one microphone to receive at least one of;
  
  a near-end speaker signal, an ambient noise signal, or the loudspeaker signal and to generate a microphone signal;
  
  an acoustic-echo-canceller (AEC) to receive the reference signal and the microphone signal, and to generate an AEC echo-cancelled signal;
  
  a loudspeaker signal estimator to receive the microphone signal and the AEC echo-cancelled signal and to generate an estimated loudspeaker signal; and
  
  a deep neural network (DNN) to receive the microphone signal, the reference signal, the AEC echo-cancelled signal, and the estimated loudspeaker signal, and to generate a clean speech signal,wherein the DNN is trained offline by exciting the at least one microphone using a target training signal that includes a signal approximation of clean speech.
- View Dependent Claims (2, 3, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, wherein the DNN generating the clean speech signal includes:
    - the DNN generating at least one of;
      
      an estimate of non-linear echo in the microphone signal that is not cancelled by the AEC, an estimate of residual echo in the microphone signal, or an estimate of ambient noise power level in the microphone signal, andthe DNN generating the clean speech signal based on the estimate of non-linear echo in the microphone signal that is not cancelled by the AEC, the estimate of residual echo in the microphone signal, or the estimate of ambient noise power level.
  - 3. The system of claim 1, wherein the DNN is one of a deep feed-forward neural network, a deep recursive neural network, or a deep convolutional neural network.
  - 5. The system of claim 1, further comprising:
    - a time-frequency transformer to transform the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal from a time domain to a frequency domain, wherein the DNN receives and processes the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal in the frequency domain, and the DNN to generate the clean speech signal in the frequency domain; and
      
      a frequency-time transformer to transform the clean speech signal in the frequency domain to a clean speech signal in the time domain.
  - 6. The system of claim 5, further comprising:
    - a plurality of feature processors, each feature processor to respectively extract and transmit features of the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal to the DNN.
  - 7. The system of claim 6, wherein each of the feature processors include:
    - a smoothed power spectral density (PSD) unit to calculate a smoothed PSD, and a feature extractor to extract one of the features of the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal,a first normalization unit to normalize the smoothed PSD using a global mean and variance from the training data, anda second normalization unit to normalize the extracted one of the features using a global mean and variance from the training data, andwherein the system further includes;
      
      a plurality of feature buffers to receive the normalized smoothed PSD and the normalized extracted feature from each of the feature processors, respectively, and to respectively buffer the extracted features with a number of past or future frames.
  - 8. The system of claim 6, whereinthe microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal in the frequency domain are complex signals including a magnitude component and a phase component.
  - 9. The system of claim 8, wherein each of the feature processors include:
    - a smoothed power spectral density (PSD) unit to calculate a smoothed PSD, anda feature extractor to extract one of the features of the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal,a first normalization unit to normalize the smoothed PSD using a global mean and variance from the training data, anda second normalization unit to normalize the extracted one of the features using a global mean and variance from the training data, andwherein the system further includes;
      
      a plurality of feature buffers to receive the normalized smoothed PSD and the normalized extracted feature from each of the feature processors, respectively, and to respectively buffer the extracted features with a number of past or future frames.

4. (canceled)

10. A system for performing speech enhancement using a Deep Neural Network (DNN)-based signal comprising:
- a loudspeaker to output a loudspeaker signal, wherein the loudspeaker is being driven by a reference signal;
  
  at least one microphone to receive at least one of;
  
  a near-end speaker signal, an ambient noise signal, or the loudspeaker signal and to generate a microphone signal;
  
  an acoustic-echo-canceller (AEC) to receive the reference signal and the microphone signal, and to generate an AEC echo-cancelled signal;
  
  a loudspeaker signal estimator to receive the microphone signal and the AEC echo-cancelled signal and to generate an estimated loudspeaker signal; and
  
  a deep neural network (DNN) to receive the microphone signal, the reference signal, the AEC echo-cancelled signal, and the estimated loudspeaker signal, and to generate a speech reference signal that includes signal statistics for residual echo or signal statistics for noise,wherein the DNN is trained offline by exciting the at least one microphone using a target training signal that includes a signal approximation of clean speech.
- View Dependent Claims (11, 12, 14, 15, 16, 17)
- - 11. The system of claim 10, wherein the speech reference signal that includes signal statistics for residual echo or signal statistics for noise includes at least one of:
    - an estimate of non-linear echo in the microphone signal that is not cancelled by the AEC, an estimate of residual echo in the microphone signal, or an estimate of ambient noise power level in the microphone signal.
  - 12. The system of claim 10, wherein the DNN is one of a deep feed-forward neural network, a deep recursive neural network, or a deep convolutional neural network.
  - 14. The system of claim 10, further comprising:
    - a time-frequency transformer to transform the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal from a time domain to a frequency domain, wherein the DNN receives and processes the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal in the frequency domain, and the DNN to generate the speech reference in the frequency domain.
  - 15. The system of claim 14, further comprising:
    - a noise suppressor to receive the AEC echo-cancelled signal and the speech reference in the frequency domain, to suppress noise or residual echo in the microphone signal based on the speech reference and to output a clean speech signal in the frequency domain; and
      
      a frequency-time transformer to transform the clean speech signal in the frequency domain to a clean speech signal in the time domain.
  - 16. The system of claim 15, further comprising a plurality of feature processors, each feature processor to respectively extract and transmit features of the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal to the DNN.
  - 17. The system of claim 16, wherein each of the feature processors include:
    - a smoothed power spectral density (PSD) unit to calculate a smoothed PSD, anda feature extractor to extract one of the features of the microphone signal, the reference signal, the AEC echo-cancelled signal and the estimated loudspeaker signal,a first normalization unit to normalize the smoothed PSD using a global mean and variance from the training data, anda second normalization unit to normalize the extracted one of the features using a global mean and variance from the training data, andwherein the system further includes;
      
      a plurality of feature buffers to receive the normalized smoothed PSD and the normalized extracted feature from each of the feature processors, respectively, and to respectively buffer the extracted features with a number of past or future frames.

13. (canceled)

18. A method for performing speech enhancement using a Deep Neural Network (DNN)-based signal comprising:
- training a deep neural network (DNN) offline by exciting at least one microphone using a target training signal that includes a signal approximation of clean speech;
  
  driving a loudspeaker with a reference signal, wherein the loudspeaker outputs a loudspeaker signal;
  
  generating by the at least one microphone a microphone signal based on at least one of;
  
  a near-end speaker signal, an ambient noise signal, or the loudspeaker signal;
  
  generating by an acoustic-echo-canceller (AEC) an AEC echo-cancelled signal based on the reference signal and the microphone signal;
  
  generating by a loudspeaker signal estimator an estimated loudspeaker signal based on the microphone signal and the AEC echo-cancelled signal;
  
  receiving by the DNN the microphone signal, the reference signal, the AEC echo-cancelled signal, and the estimated loudspeaker signal; and
  
  generating by the DNN a speech reference signal that includes signal statistics for residual echo or signal statistics for noise based on the microphone signal, the reference signal, the AEC echo-cancelled signal, and the estimated loudspeaker signal.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18, wherein the speech reference signal that includes signal statistics for residual echo includes at least one of:
    - an estimate of non-linear echo in the microphone signal that is not cancelled by the AEC, an estimate of residual echo in the microphone signal, or an estimate of ambient noise power level in the microphone signal.
  - 20. The method of claim 19, further comprising:
    - generating by a noise suppressor a clean speech signal by suppressing noise or residual echo in the microphone signal based on speech reference signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Wung, Jason, Pishehvar, Ramin, Giacobello, Daniele, Atkins, Joshua D.

Granted Patent

US 10,074,380 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 2021/02082   the noise being echo, rever...

G10L 21/0232   Processing in the frequency...

G10L 25/30   using neural networks

G10L 25/87   Detection of discrete point...

SYSTEM AND METHOD FOR PERFORMING SPEECH ENHANCEMENT USING A DEEP NEURAL NETWORK-BASED SIGNAL

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

33 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

SYSTEM AND METHOD FOR PERFORMING SPEECH ENHANCEMENT USING A DEEP NEURAL NETWORK-BASED SIGNAL

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others