Multi-sensory speech enhancement using a speech-state model

US 7,680,656 B2
Filed: 06/28/2005
Issued: 03/16/2010
Est. Priority Date: 06/28/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method of determining an estimate for a noise-reduced value representing a portion of a noise-reduced speech signal, the method comprising:

generating an alternative sensor signal using an alternative sensor;

generating an air conduction microphone signal;

using the alternative sensor signal and the air conduction microphone signal to estimate a likelihood, L(S_t) of a speech state, S_tby estimating a separate likelihood of the speech state for each of a set of frequency components and combining the separate likelihoods to form the likelihood of the speech state; and

using the likelihood of the speech state to estimate the noise-reduced value, {circumflex over (X)}_t, as;

${\hat{X}}_{t} = \sum_{s \in {S}} π_{s} E (X_{t} ❘ Y_{t}, B_{t}, S_{t} = s)$ where π

_sis a posterior on the state and is given by;

$π_{s} = \frac{L (S_{t} = s)}{\sum_{s \in {S}} L (S_{t} = s)}$ and where;

$E (X_{t} ❘ Y_{t}, B_{t}, S_{t} = s) = σ_{s}^{2} (\frac{σ_{p}^{2} Y_{t} + M * ((σ_{u}^{2} + g^{2} σ_{v}^{2}) B_{t} - g^{2} σ_{v}^{2} {GY}_{t})}{σ_{p}^{2} (σ_{u}^{2} + g^{2} σ_{v}^{2} + σ_{s}^{2}) + {\langle M \rangle}^{2} σ_{s}^{2} (σ_{u}^{2} + g^{2} σ_{v}^{2})})$ $where; σ_{p}^{2} = σ_{w}^{2} + \frac{g^{2} σ_{v}^{2} σ_{u}^{2}}{σ_{u}^{2} + g^{2} σ_{v}^{2}} {\langle G \rangle}^{2} and$ $M = H - \frac{g^{2} σ_{v}^{2}}{σ_{u}^{2} + g^{2} σ_{v}^{2}} G$ where M* is the complex conjugate of M, X_tis a noise reduced value, Y_tis a value for a frame t of the air conduction microphone signal, B_tis a value for a frame t of the alternative sensor signal, σ

_u²is a variance of sensor noise in the air conduction microphone, σ

_w²is a variance of sensor noise in the alternative sensor, g²σ

_v²is the variance of ambient noise, G is the channel response of the alternative sensor to ambient noise, H is the channel response of the alternative sensor to a clean speech signal, S is the set of all speech states, σ

_s²is a variance for a distribution that models a probability of a noise-reduced value given a speech state and E(X_t|Y_t,B_t,S_t=s) is the expectation of X_tgiven Y_t, B_t, and a speech state of s.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus determine a likelihood of a speech state based on an alternative sensor signal and an air conduction microphone signal. The likelihood of the speech state is used, together with the alternative sensor signal and the air conduction microphone signal, to estimate a clean speech value for a clean speech signal.

79 Citations

View as Search Results

13 Claims

1. A method of determining an estimate for a noise-reduced value representing a portion of a noise-reduced speech signal, the method comprising:
- generating an alternative sensor signal using an alternative sensor;
  
  generating an air conduction microphone signal;
  
  using the alternative sensor signal and the air conduction microphone signal to estimate a likelihood, L(S_t) of a speech state, S_tby estimating a separate likelihood of the speech state for each of a set of frequency components and combining the separate likelihoods to form the likelihood of the speech state; and
  
  using the likelihood of the speech state to estimate the noise-reduced value, {circumflex over (X)}_t, as;
  
  ${\hat{X}}_{t} = \sum_{s \in {S}} π_{s} E (X_{t} ❘ Y_{t}, B_{t}, S_{t} = s)$ where π
  
  _sis a posterior on the state and is given by;
  
  $π_{s} = \frac{L (S_{t} = s)}{\sum_{s \in {S}} L (S_{t} = s)}$ and where;
  
  $E (X_{t} ❘ Y_{t}, B_{t}, S_{t} = s) = σ_{s}^{2} (\frac{σ_{p}^{2} Y_{t} + M * ((σ_{u}^{2} + g^{2} σ_{v}^{2}) B_{t} - g^{2} σ_{v}^{2} {GY}_{t})}{σ_{p}^{2} (σ_{u}^{2} + g^{2} σ_{v}^{2} + σ_{s}^{2}) + {\langle M \rangle}^{2} σ_{s}^{2} (σ_{u}^{2} + g^{2} σ_{v}^{2})})$ $where; σ_{p}^{2} = σ_{w}^{2} + \frac{g^{2} σ_{v}^{2} σ_{u}^{2}}{σ_{u}^{2} + g^{2} σ_{v}^{2}} {\langle G \rangle}^{2} and$ $M = H - \frac{g^{2} σ_{v}^{2}}{σ_{u}^{2} + g^{2} σ_{v}^{2}} G$ where M* is the complex conjugate of M, X_tis a noise reduced value, Y_tis a value for a frame t of the air conduction microphone signal, B_tis a value for a frame t of the alternative sensor signal, σ
  
  _u²is a variance of sensor noise in the air conduction microphone, σ
  
  _w²is a variance of sensor noise in the alternative sensor, g²σ
  
  _v²is the variance of ambient noise, G is the channel response of the alternative sensor to ambient noise, H is the channel response of the alternative sensor to a clean speech signal, S is the set of all speech states, σ
  
  _s²is a variance for a distribution that models a probability of a noise-reduced value given a speech state and E(X_t|Y_t,B_t,S_t=s) is the expectation of X_tgiven Y_t, B_t, and a speech state of s.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 further comprising using the estimate of the likelihood of a speech state to determine if a frame of the air conduction microphone signal contains speech.
  - 3. The method of claim 2 further comprising using a frame of the air conduction microphone signal that is determined to not contain speech to determine a variance for a noise source and using the variance for the noise source to estimate the noise-reduced value.
  - 4. The method of claim 1 further comprising estimating the variance of the distribution as a linear combination of an estimate of a noise-reduced value for a preceding frame and a filtered version of the air conduction microphone signal for a current frame.
  - 5. The method of claim 4 wherein the filtered version of the air conduction microphone signal is formed using a filter that is frequency dependent.
  - 6. The method of claim 4 wherein the filtered version of the air conduction microphone signal is formed using a filter that is dependent on a signal-to-noise ratio.
  - 7. The method of claim 1 further comprising performing an iteration by using the estimate of the noise-reduced value to form a new estimate of the noise-reduced value.

8. A computer storage medium having stored thereon computer-executable instructions that when executed by a processor cause the processor to perform steps comprising:
- receiving an alternative sensor signal generated using an alternative sensor;
  
  receiving an air conduction microphone signal generated using an air conduction microphone;
  
  determining a likelihood of a speech state based on the alternative sensor signal and the air conduction microphone signal by estimating a separate likelihood of the speech state for each frequency, L(S_t(f)), of a set of frequency components and forming a product of the separate likelihoods to form the likelihood of the speech state, L(S_t) as;
  
  $L (S_{t}) = \prod_{f} L (S_{t} (f)),$ where the product is taken across all frequency components f in the set of frequency components; and
  
  using the likelihood of the speech state to estimate a clean speech value.
- View Dependent Claims (9, 10)
- - 9. The computer storage medium of claim 8 wherein using the likelihood of the speech state to estimate a clean speech value comprises weighting an expectation value.
  - 10. The computer storage medium of claim 8 wherein using the likelihood of the speech state to estimate a clean speech value comprises:
    - using the likelihood of the speech state to identify a frame of a signal as a non-speech frame;
      
      using the non-speech frame to estimate a variance for a noise; and
      
      using the variance for the noise to estimate the clean speech value.

11. A method of identifying a clean speech value for a clean speech signal, the method comprising:
- receiving an alternative sensor signal generated using an alternative sensor;
  
  receiving an air conduction microphone signal generated using an air conduction microphone;
  
  forming a model wherein the clean speech signal is dependent upon a speech state, the alternative sensor signal is dependent upon the clean speech signal, and the air conduction microphone signal is dependent upon the clean speech signal, wherein forming the model comprises modeling a probability of a value of the clean speech signal given a speech state as a distribution having a variance; and
  
  determining a filtered value of the air conduction microphone signal by applying a value for a current frame of the air conduction microphone signal to a frequency-dependent noise suppression filter that is a function of a variance of ambient noise;
  
  determining the variance of the distribution as a linear combination of an estimate of a value for a clean speech signal for a preceding frame and the filtered value of the air conduction microphone signal as {circumflex over (σ
  
  )}_s²=τ
  
  |{circumflex over (X)}_t-1|²+(1−
  
  τ
  
  )K_s²|Y_t|², where {circumflex over (σ
  
  )}_s²is the variance of the distribution, {circumflex over (X)}_t-1is the clean speech estimate from the preceding frame, τ
  
  is a smoothing factor, |Y_t|²is the value for the current frame of the air conduction microphone signal and K_sis the noise suppression filter;
  
  determining an estimate of the clean speech value for the current frame based on the model, the variance of the distribution, a value for the alternative sensor signal for the current frame, and a value for the air conduction microphone signal for the current frame.
- View Dependent Claims (12, 13)
- - 12. The method of claim 11 further comprising determining a likelihood for a state and wherein determining an estimate of the clean speech value further comprises using the likelihood for the state.
  - 13. The method of claim 11 wherein forming the model comprises forming a model wherein the alternative sensor signal and the air conduction microphone signal are dependent upon a noise source.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Zhang, Zhengyou, Droppo, James G., Acero, Alejandro, Liu, Zicheng, Subramanya, Amarnag
Primary Examiner(s)
Wozniak; James S

Application Number

US11/168,770
Publication Number

US 20060293887A1
Time in Patent Office

1,722 Days
Field of Search

704/231, 704/233, 704/240
US Class Current

704/233
CPC Class Codes

G10L 2021/02165 Two microphones, one receiv...

G10L 21/0208 Noise filtering

Multi-sensory speech enhancement using a speech-state model

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

79 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Multi-sensory speech enhancement using a speech-state model

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

79 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links