Method for reducing noise distortions in a speech recognition system
First Claim
Patent Images
1. A system for reducing noise distortions in speech data, comprising:
- a feature extractor configured to perform a manipulation process on said speech data, wherein said feature extractor is comprised of;
a noise suppressor that performs a spectral subtraction procedure on said speech data, which is expressed by a formula;
where YD(Y) is a signal-to-noise ratio or a distorted estimation of clean speech, Y is a power or magnitude spectrum of noisy speech, N is an estimate of a power or magnitude noise spectrum, α
is an over-estimation factor, and β
is a spectral flooring parameter; and
a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
where Ct(p) is a pth cepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for reducing noise distortions in a speech recognition system comprises a feature extractor that includes a noise-suppressor, one or more time cosine transforms, and a normalizer. The noise-suppressor preferably performs a spectral subtraction process early in the feature extraction procedure. The time cosine transforms preferably operate in a centered-mode to each perform a transformation in the time domain. The normalizer calculates and utilizes normalization values to generate normalized features for speech recognition. The calculated normalization values preferably include mean values, left variances and right variances.
-
Citations
54 Claims
-
1. A system for reducing noise distortions in speech data, comprising:
-
a feature extractor configured to perform a manipulation process on said speech data, wherein said feature extractor is comprised of;
a noise suppressor that performs a spectral subtraction procedure on said speech data, which is expressed by a formula;
where YD(Y) is a signal-to-noise ratio or a distorted estimation of clean speech, Y is a power or magnitude spectrum of noisy speech, N is an estimate of a power or magnitude noise spectrum, α
is an over-estimation factor, and β
is a spectral flooring parameter; and
a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
where Ct(p) is a pth cepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14)
-
-
11. A system for reducing noise distortions in audio data, comprising:
-
a feature extractor configured to perform a manipulation process on said audio data; and
a processor configured to control said feature extractor, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor performs a normalization procedure to convert said static cepstral features into normalized static cepstral features, to convert said delta cepstral features into normalized delta cepstral features, and to convert said delta-delta cepstral features into normalized delta-delta cepstral features.
-
-
12. A system for reducing noise distortions in audio data, comprising:
-
a feature extractor configured to perform a manipulation process on said audio data; and
a processor configured to control said feature extractor, said feature extractor including a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
where Ct(p) is a pth cepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.
-
-
15. A system for reducing noise distortions in audio data, comprising:
-
a feature extractor configured to perform a manipulation process on said audio data; and
a processor configured to control said feature extractor, wherein said feature extractor generates features, and a normalizer uses normalization values to perform a normalization procedure, said normalization values including a mean value, a left variance, and a right variance, said mean value being an average energy for a frame of feature energy, said right variance being a difference between said mean value and a maximum energy for said frame, and said left variance being a difference between said mean value and a noise level for said frame. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24)
where xi is an “
ith”
component of an original feature vector, and ai and lvi, are a respective mean value and a left variance of said ith component.
-
-
17. The system of claim 15 wherein, when a current energy for said frame is greater than said mean value, said normalization procedure may be expressed by a following formula:
-
where xi is an “
ith”
component of an original feature vector, and ai and rvi, are a respective mean value and a right variance of said ith component.
-
-
18. The system of claim 15 wherein said mean value is calculated in an off-line mode and is expressed by a formula:
-
where x(t) is a signal energy at a time frame t, N is a total number of feature components associated with said left variance, and P is a total number of feature components associated with said right variance.
-
-
19. The system of claim 15 wherein said right variance is calculated in an off-line mode and is expressed by a formula:
-
where x(t) is a signal energy at a time frame t and P is a total number of feature components associated with said right variance.
-
-
20. The system of claim 15 wherein said left variance is calculated in an off-line mode and is expressed by a formula:
-
where x(t) is a signal energy at a time frame t and N is a total number of feature components associated with said left variance.
-
-
21. The system of claim 15 wherein said mean value for a given feature component “
- t”
at a time frame “
t”
is calculated in an on-line mode and is expressed by a formula;
- t”
-
22. The system of claim 15 wherein said right variance “
- rvi(t)”
for a given feature component “
t”
at a time frame “
t”
is calculated in an on-line mode and is expressed by a formula;
- rvi(t)”
-
23. The system of claim 15 wherein said left variance “
- lvi(t)”
for a given feature component “
t”
at a time frame “
t”
is calculated in an on-line mode and is expressed by a formula;
- lvi(t)”
-
24. The system of claim 21 wherein said forgetting factor of β
- is equal to a value of 0.995.
-
25. A method for reducing noise distortions in speech data, comprising the steps of:
-
suppressing noise in said speech data using a spectral subtraction procedure that is expressed by a formula;
where YD(Y) is a signal-to-noise ratio or a distorted estimation of clean speech, Y is a power or magnitude spectrum of noisy speech, N is an estimate of a power or magnitude noise spectrum, α
is an over-estimation factor, and β
is a spectral flooring parameter; andconverting said static features into delta features using a first time cosine transform, and converting said static features into delta-delta features using a second time cosine transform, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
where Ct(p) is a pth cepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A method for reducing noise distortions in audio data, comprising the steps of:
-
performing a manipulation process on said audio data using a feature extractor;
controlling said feature extractor with a processor to thereby reduce said noise distortions, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor converts said static cepstral features into normalized static cepstral features, converts said delta cepstral features into normalized delta cepstral features, and converts said delta-delta cepstral features into normalized delta-delta cepstral features. - View Dependent Claims (37, 38, 39)
-
-
40. A method for reducing noise distortions in audio data, comprising the steps of:
-
performing a manipulation process on said audio data using a feature extractor; and
controlling said feature extractor with a processor to thereby reduce said noise distortions, said feature extractor including a first time cosine transform that converts said static features into delta features, and a second time cosine transform that converts said static features into delta-delta features, wherein said first time cosine transform and said second time cosine transform each perform a centered-mode time cosine transform procedure that may be expressed by a following formula;
where Ct(p) is a pth cepstral coefficient at a time frame t, M is half of a window size used to estimate differential coefficients, and o is a derivatives order with a value of one when corresponding to said delta features and a value of two when corresponding to said delta-delta features.
-
-
41. A method for reducing noise distortions in audio data, comprising the steps of:
-
performing a manipulation process on said audio data using a feature extractor; and
controlling said feature extractor with a processor to thereby reduce said noise distortions, wherein said feature extractor generates features, and a normalizer uses normalization values to perform a normalization procedure, said normalization values including a mean value, a left variance, and a right variance, said mean value being an average energy magnitude of feature energy, said right variance being an average right dispersion above said mean value, and said left variance being an average left dispersion below said mean value. - View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49, 50)
where xi is an “
ith”
component of an original feature vector, and ai and lvi, are a respective mean value and a left variance of said ith component.
-
-
43. The method of claim 41 wherein, when a current energy for said frame is greater than said mean value, said normalization procedure may be expressed by a following formula:
-
where xi is an “
ith”
component of an original feature vector, and ai and rvi, are a respective mean value and a right variance of said ith component.
-
-
44. The method of claim 41 wherein said mean value is calculated in an off-line mode and is expressed by a formula:
-
where x(t) is a signal energy at a time frame t, N is a total number of feature components associated with said left variance, and P is a total number of feature components associated with said right variance.
-
-
45. The method of claim 41 wherein said right variance is calculated in an off-line mode and is expressed by a formula:
-
where x(t) is a signal energy at a time frame t and P is a total number of feature components associated with said right variance.
-
-
46. The method of claim 41 wherein said left variance is calculated in an off-line mode and is expressed by a formula:
-
where x(t) is a signal energy at a time frame t and N is a total number of feature components associated with said left variance.
-
-
47. The method of claim 41 wherein said right variance “
- rvi(t)”
for a given feature component “
t”
at a time frame “
t”
is calculated in an on-line mode and is expressed by a formula;
- rvi(t)”
-
48. The method of claim 41 wherein said left variance “
- lvi(t)”
for a given feature component “
t”
at a time frame “
t”
is calculated in an on-line mode and is expressed by a formula;
- lvi(t)”
-
49. The method of claim 41 wherein said mean value for a given feature component “
- t”
at a time frame “
t”
is calculated in an on-line mode and is expressed by a formula;
- t”
-
50. The method of claim 49 wherein said forgetting factor of β
- is equal to a value of 0.95.
-
51. A computer-readable medium comprising program instructions for reducing noise distortions in audio data by performing the steps of:
-
performing a manipulation process on said audio data using a feature extractor; and
controlling said feature extractor with a processor to thereby reduce said noise distortions, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor converts said static cepstral features into normalized static cepstral features, converts said delta cepstral features into normalized delta cepstral features, and converts said delta-delta cepstral features into normalized delta-delta cepstral features. - View Dependent Claims (52)
-
-
53. A system for reducing noise distortions in audio data, comprising:
-
means for performing a manipulation process on said audio data using a feature extractor; and
means for controlling said feature extractor to thereby reduce said noise distortions, wherein said feature extractor generates static cepstral features, a first centered-mode time cosine transform converts said static cepstral features into delta cepstral features, and a second centered-mode time cosine transform converts said static cepstral features into delta-delta cepstral features, and wherein said feature extractor converts said static cepstral features into normalized static cepstral features, converts said delta cepstral features into normalized delta cepstral features, and converts said delta-delta cepstral features into normalized delta-delta cepstral features. - View Dependent Claims (54)
-
Specification