Method for reducing noise distortions in a speech recognition system

US 6,173,258 B1
Filed: 10/22/1998
Issued: 01/09/2001
Est. Priority Date: 09/09/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A system for reducing noise distortions in speech data, comprising:

a feature extractor configured to perform a manipulation process on said speech data, wherein said feature extractor is comprised of;

a noise suppressor that performs a spectral subtraction procedure on said speech data, which is expressed by a formula;

$Y_{D} (Y) = {\begin{matrix} Y - α N & Y - α N > β Y \\ β Y & otherwise \end{matrix}$ where Y_D(Y) is a signal-to-noise ratio or a distorted estimation of clean speech, Y is a power or magnitude spectrum of noisy speech, N is an estimate of a power or magnitude noise spectrum, α

is an over-estimation factor, and β

is a spectral flooring parameter; and

a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;

$\frac{\partial^{o}}{\partial_{t}} C_{t} (p) = \sum_{k = - M}^{M} C_{t + k} (p) \cos (\frac{k + M + 0.5}{2 M + 1} o π)$ where C_t(p) is a p^thcepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for reducing noise distortions in a speech recognition system comprises a feature extractor that includes a noise-suppressor, one or more time cosine transforms, and a normalizer. The noise-suppressor preferably performs a spectral subtraction process early in the feature extraction procedure. The time cosine transforms preferably operate in a centered-mode to each perform a transformation in the time domain. The normalizer calculates and utilizes normalization values to generate normalized features for speech recognition. The calculated normalization values preferably include mean values, left variances and right variances.

Citations

54 Claims

1. A system for reducing noise distortions in speech data, comprising:
- a feature extractor configured to perform a manipulation process on said speech data, wherein said feature extractor is comprised of;
  
  a noise suppressor that performs a spectral subtraction procedure on said speech data, which is expressed by a formula;
  
  $Y_{D} (Y) = {\begin{matrix} Y - α N & Y - α N > β Y \\ β Y & otherwise \end{matrix}$ where Y_D(Y) is a signal-to-noise ratio or a distorted estimation of clean speech, Y is a power or magnitude spectrum of noisy speech, N is an estimate of a power or magnitude noise spectrum, α
  
  is an over-estimation factor, and β
  
  is a spectral flooring parameter; and
  
  a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
  
  $\frac{\partial^{o}}{\partial_{t}} C_{t} (p) = \sum_{k = - M}^{M} C_{t + k} (p) \cos (\frac{k + M + 0.5}{2 M + 1} o π)$ where C_t(p) is a p^thcepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14)
- - 2. The system of claim 1 wherein said feature extractor is part of a speech module configured to analyze and manipulate said speech data.
  - 3. The system of claim 1 wherein said feature extractor includes program instructions that are stored in a memory device coupled to said processor.
  - 4. The system of claim 1 wherein said feature extractor further includes a normalizer configured to perform a normalization procedure on said speech data.
  - 5. The system of claim 2 wherein said speech data includes digital source speech data that is provided to said speech module by an analog sound sensor and an analog-to-digital converter.
  - 6. The system of claim 5 wherein said digital source speech data is converted to frequency-domain speech data by a fast Fourier transform.
  - 7. The system of claim 5 wherein a filter bank generates filtered channel energy by separating said noise-suppressed speech data into discrete frequency channels.
  - 8. The system of claim 7 wherein said filtered channel energy is converted into logarithmic channel energy by a logarithmic compressor.
  - 9. The system of claim 8 wherein said logarithmic channel energy is converted into static features by a frequency cosine transform.
  - 10. The system of claim 9 wherein said static features are cepstral features that decorrelate said channels in said logarithmic channel energy.
  - 13. The system of claim 4 wherein said normalization procedure converts said static features into normalized static features, converts said delta features into normalized delta features, and converts said delta-delta features into normalized delta-delta features.
  - 14. The system of claim 13 wherein said normalized static features, said normalized delta features, and said normalized delta-delta features are provided to a recognizer that responsively generates a speech recognition result.

11. A system for reducing noise distortions in audio data, comprising:
- a feature extractor configured to perform a manipulation process on said audio data; and
  
  a processor configured to control said feature extractor, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor performs a normalization procedure to convert said static cepstral features into normalized static cepstral features, to convert said delta cepstral features into normalized delta cepstral features, and to convert said delta-delta cepstral features into normalized delta-delta cepstral features.

12. A system for reducing noise distortions in audio data, comprising:
- a feature extractor configured to perform a manipulation process on said audio data; and
  
  a processor configured to control said feature extractor, said feature extractor including a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
  
  $\frac{\partial^{o}}{\partial_{t}} C_{t} (p) = \sum_{k = - M}^{M} C_{t + k} (p) \cos (\frac{k + M + 0.5}{2 M + 1} o π)$ where C_t(p) is a p^thcepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.

15. A system for reducing noise distortions in audio data, comprising:
- a feature extractor configured to perform a manipulation process on said audio data; and
  
  a processor configured to control said feature extractor, wherein said feature extractor generates features, and a normalizer uses normalization values to perform a normalization procedure, said normalization values including a mean value, a left variance, and a right variance, said mean value being an average energy for a frame of feature energy, said right variance being a difference between said mean value and a maximum energy for said frame, and said left variance being a difference between said mean value and a noise level for said frame.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 16. The system of claim 15 wherein, when a current energy for said frame is less than said mean value, said normalization procedure may be expressed by a formula:
    - ${\overline{x}}_{i} = \frac{a_{i} - x_{i}}{l_{v_{i}}} x_{i} < a_{i}$
17. The system of claim 15 wherein, when a current energy for said frame is greater than said mean value, said normalization procedure may be expressed by a following formula:
- ${\overline{x}}_{i} = \frac{x_{i} - a_{i}}{r_{v_{i}}} x_{i} > a_{i}$ where x_iis an “
  
  ith”
  
  component of an original feature vector, and a_iand ^rv_i, are a respective mean value and a right variance of said ith component.
18. The system of claim 15 wherein said mean value is calculated in an off-line mode and is expressed by a formula:
- $a = \sum_{t}^{N + P} \frac{x (t)}{N + P}$ where x(t) is a signal energy at a time frame t, N is a total number of feature components associated with said left variance, and P is a total number of feature components associated with said right variance.
19. The system of claim 15 wherein said right variance is calculated in an off-line mode and is expressed by a formula:
- $r_{v} = \sum_{t}^{P} \frac{x (t) - a}{P}$ where x(t) is a signal energy at a time frame t and P is a total number of feature components associated with said right variance.
20. The system of claim 15 wherein said left variance is calculated in an off-line mode and is expressed by a formula:
- $l_{v} = \sum_{t}^{N} \frac{a - x (t)}{N}$ where x(t) is a signal energy at a time frame t and N is a total number of feature components associated with said left variance.
21. The system of claim 15 wherein said mean value for a given feature component “
- t”
  
  at a time frame “
  
  t”
  
  is calculated in an on-line mode and is expressed by a formula;
22. The system of claim 15 wherein said right variance “
- r_vi(t)”
  
  for a given feature component “
  
  t”
  
  at a time frame “
  
  t”
  
  is calculated in an on-line mode and is expressed by a formula;
23. The system of claim 15 wherein said left variance “
- l_vi(t)”
  
  for a given feature component “
  
  t”
  
  at a time frame “
  
  t”
  
  is calculated in an on-line mode and is expressed by a formula;
24. The system of claim 21 wherein said forgetting factor of β
- is equal to a value of 0.995.

25. A method for reducing noise distortions in speech data, comprising the steps of:
- suppressing noise in said speech data using a spectral subtraction procedure that is expressed by a formula;
  
  $Y_{D} (Y) = {\begin{matrix} Y - α N & Y - α N > β Y \\ β Y & otherwise \end{matrix}$ where Y_D(Y) is a signal-to-noise ratio or a distorted estimation of clean speech, Y is a power or magnitude spectrum of noisy speech, N is an estimate of a power or magnitude noise spectrum, α
  
  is an over-estimation factor, and β
  
  is a spectral flooring parameter; and
  
  converting said static features into delta features using a first time cosine transform, and converting said static features into delta-delta features using a second time cosine transform, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
  
  $\frac{\partial^{o}}{\partial_{t}} C_{t} (p) = \sum_{k = - M}^{M} C_{t + k} (p) \cos (\frac{k + M + 0.5}{2 M + 1} o π)$ where C_t(p) is a p^thcepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 26. The method of claim 25 wherein said feature extractor includes program instructions that are stored in a memory device coupled to said processor.
  - 27. The method of claim 25 wherein said feature extractor is part of a speech module configured to analyze and manipulate said speech data.
  - 28. The method of claim 27 wherein said speech data includes digital source speech data that is provided to said speech module by an analog sound sensor and an analog-to-digital converter.
  - 29. The method of claim 28 wherein said digital source speech data is converted to frequency-domain speech data by a fast Fourier transform.
  - 30. The method of claim 29 wherein a filter bank generates filtered channel energy by separating said noise-suppressed speech data into discrete frequency channels.
  - 31. The method of claim 30 wherein said filtered channel energy is converted into logarithmic channel energy by a logarithmic compressor.
  - 32. The method of claim 31 wherein said logarithmic channel energy is converted into static features by a frequency cosine transform.
  - 33. The method of claim 32 wherein said static features are cepstral features that decorrelate said channels in said logarithmic channel energy.
  - 34. The method of claim 25 further comprising the step of performing a normalization procedure on said speech data.
  - 35. The method of claim 34 wherein said step of performing a normalization procedure comprises the steps of converting said static features into normalized static features, converting said delta features into normalized delta features, and converting said delta-delta features into normalized delta-delta features.

36. A method for reducing noise distortions in audio data, comprising the steps of:
- performing a manipulation process on said audio data using a feature extractor;
  
  controlling said feature extractor with a processor to thereby reduce said noise distortions, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor converts said static cepstral features into normalized static cepstral features, converts said delta cepstral features into normalized delta cepstral features, and converts said delta-delta cepstral features into normalized delta-delta cepstral features.
- View Dependent Claims (37, 38, 39)
- - 37. The method of claim 36 wherein a normalizer performs a normalization procedure to convert said static features into normalized static features, to convert said delta features into normalized delta features, and to convert said delta-delta features into normalized delta-delta features.
  - 38. The method of claim 37 wherein said normalized static features, said normalized delta features, and said normalized delta-delta features are provided to a recognizer that responsively generates a speech recognition result.
  - 39. The method of claim 38 wherein said recognizer is a Hidden Markoff Model recognizer.

40. A method for reducing noise distortions in audio data, comprising the steps of:
- performing a manipulation process on said audio data using a feature extractor; and
  
  controlling said feature extractor with a processor to thereby reduce said noise distortions, said feature extractor including a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
  
  $\frac{\partial^{o}}{\partial_{t}} C_{t} (p) = \sum_{k = - M}^{M} C_{t + k} (p) \cos (\frac{k + M + 0.5}{2 M + 1} o π)$ where C_t(p) is a p^thcepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.

41. A method for reducing noise distortions in audio data, comprising the steps of:
- performing a manipulation process on said audio data using a feature extractor; and
  
  controlling said feature extractor with a processor to thereby reduce said noise distortions, wherein said feature extractor generates features, and a normalizer uses normalization values to perform a normalization procedure, said normalization values including a mean value, a left variance, and a right variance, said mean value being an average energy magnitude of feature energy, said right variance being an average right dispersion above said mean value, and said left variance being an average left dispersion below said mean value.
- View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49, 50)
- - 42. The method of claim 41 wherein, when a current energy for said frame is less than said mean value, said normalization procedure may be expressed by a formula:
    - ${\overline{x}}_{i} = \frac{a_{i} - x_{i}}{l_{v_{i}}} x_{i} < a_{i}$
43. The method of claim 41 wherein, when a current energy for said frame is greater than said mean value, said normalization procedure may be expressed by a following formula:
- ${\overline{x}}_{i} = \frac{x_{i} - a_{i}}{r_{v_{i}}} x_{i} > a_{i}$ where x_iis an “
  
  ith”
  
  component of an original feature vector, and a_iand ^rv_i, are a respective mean value and a right variance of said ith component.
44. The method of claim 41 wherein said mean value is calculated in an off-line mode and is expressed by a formula:
- $a = \sum_{t}^{N + P} \frac{x (t)}{N + P}$ where x(t) is a signal energy at a time frame t, N is a total number of feature components associated with said left variance, and P is a total number of feature components associated with said right variance.
45. The method of claim 41 wherein said right variance is calculated in an off-line mode and is expressed by a formula:
- $r_{v} = \sum_{t}^{P} \frac{x (t) - a}{P}$ where x(t) is a signal energy at a time frame t and P is a total number of feature components associated with said right variance.
46. The method of claim 41 wherein said left variance is calculated in an off-line mode and is expressed by a formula:
- $l_{v} = \sum_{t}^{N} \frac{a - x (t)}{N}$ where x(t) is a signal energy at a time frame t and N is a total number of feature components associated with said left variance.
47. The method of claim 41 wherein said right variance “
- r_vi(t)”
  
  for a given feature component “
  
  t”
  
  at a time frame “
  
  t”
  
  is calculated in an on-line mode and is expressed by a formula;
48. The method of claim 41 wherein said left variance “
- l_vi(t)”
  
  for a given feature component “
  
  t”
  
  at a time frame “
  
  t”
  
  is calculated in an on-line mode and is expressed by a formula;
49. The method of claim 41 wherein said mean value for a given feature component “
- t”
  
  at a time frame “
  
  t”
  
  is calculated in an on-line mode and is expressed by a formula;
50. The method of claim 49 wherein said forgetting factor of β
- is equal to a value of 0.95.

51. A computer-readable medium comprising program instructions for reducing noise distortions in audio data by performing the steps of:
- performing a manipulation process on said audio data using a feature extractor; and
  
  controlling said feature extractor with a processor to thereby reduce said noise distortions, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor converts said static cepstral features into normalized static cepstral features, converts said delta cepstral features into normalized delta cepstral features, and converts said delta-delta cepstral features into normalized delta-delta cepstral features.
- View Dependent Claims (52)
- - 52. The computer readable medium of claim 51 wherein said feature extractor further performs a spectral substraction procedure on said audio data.

53. A system for reducing noise distortions in audio data, comprising:
- means for performing a manipulation process on said audio data using a feature extractor; and
  
  means for controlling said feature extractor to thereby reduce said noise distortions, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor converts said static cepstral features into normalized static cepstral features, converts said delta cepstral features into normalized delta cepstral features, and converts said delta-delta cepstral features into normalized delta-delta cepstral features.
- View Dependent Claims (54)
- - 54. The system of claim 53 wherein said feature extractor further performs a spectral subtraction procedure on said audio data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nippon Kayaku Company Limited, Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Inventors
Tanaka, Miyuki, Wu, Duanpei, Chen, Ruxin, Menendez-Pidal, Xavier
Primary Examiner(s)
Zele, Krista
Assistant Examiner(s)
SAX, ROBERT L

Application Number

US09/177,461
Time in Patent Office

810 Days
Field of Search

704/226, 704/233, 704/234, 704/243, 704/240, 379/93.03, 379/93.12, 381/94.2, 381/56, 702/70-74
US Class Current

704/233
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 21/0208   Noise filtering

G10L 21/0264   characterised by the type o...

Method for reducing noise distortions in a speech recognition system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

54 Claims

Specification

Solutions

Use Cases

Quick Links

Method for reducing noise distortions in a speech recognition system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

54 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links