System for detecting voice activity and background noise/silence in a speech signal using pitch and signal to noise ratio information
DCFirst Claim
1. In a speech communication system comprising:
- (a) a speech encoder for receiving and encoding an incoming speech signal to generate a bit stream for transmission to a speech decoder;
(b) a communication channel for transmission; and
(c) a speech decoder for receiving the bit stream from the speech encoder to decode the bit stream to generate a reconstructed speech signal, the incoming speech signal comprising periods of active voice and non-active voice, a method for generating a frame voicing decision comprising the steps of;
i. extracting a predetermined set of parameters, including a pitch gain and a pitch lag, from the incoming speech signal for each frame;
ii. estimating a signal-to-noise ratio; and
iii. making a frame voicing decision according to the predetermined set of parameters and the signal-to-noise ratio.
14 Assignments
Litigations
0 Petitions
Accused Products
Abstract
A method and apparatus for generating frame voicing decisions for an incoming speech signal having periods of active voice and non-active voice for a speech encoder in a speech communications system. A predetermined set of parameters is extracted from the incoming speech signal, including a pitch gain and a pitch lag. A frame voicing decision is made for each frame of the incoming speech signal according to values calculated from the extracted parameters. The predetermined set of parameters further includes a partial residual frame full band energy, and a set of spectral parameters called Line Spectral Frequencies (LSF). A signal-to-noise value is estimated and tracked to adaptively set threshold values, thereby improving performance under various noise conditions.
42 Citations
16 Claims
-
1. In a speech communication system comprising:
-
(a) a speech encoder for receiving and encoding an incoming speech signal to generate a bit stream for transmission to a speech decoder;
(b) a communication channel for transmission; and
(c) a speech decoder for receiving the bit stream from the speech encoder to decode the bit stream to generate a reconstructed speech signal, the incoming speech signal comprising periods of active voice and non-active voice, a method for generating a frame voicing decision comprising the steps of;
i. extracting a predetermined set of parameters, including a pitch gain and a pitch lag, from the incoming speech signal for each frame;
ii. estimating a signal-to-noise ratio; and
iii. making a frame voicing decision according to the predetermined set of parameters and the signal-to-noise ratio. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
i. calculating a standard deviation C of the pitch lag;
ii. calculating a long-term mean of pitch gain;
iii. calculating a short-term average of energy E, {overscore (E)}s;
iv. calculating a short-term average of {overscore (LSF)}s;
v. calculating an average energy {overscore (E)}; and
vi. calculating an average LSF value, {overscore (LSF)}N.
-
-
4. A method according to claim 3, wherein the step of making a frame voicing decision further comprises the steps of:
-
i) calculating a spectral difference SD1 using a normalized Itakura-Saito measure;
ii) calculating a spectral difference SD2 using a mean square error method;
iii) calculating a spectral difference SD3 using a mean square error method; and
iv) calculating a long-term mean of SD2.
-
-
5. A method according to claim 4, wherein an initial frame voicing decision is made according to the calculated values.
-
6. A method according to claim 5, wherein the initial frame voicing decision is smoothed.
-
7. A method according to claim 6, wherein an initialization routine is performed for a predetermined number of initial frames, such that the voicing decision is set to active voice.
-
8. A method according to claim 1, wherein the step of estimating the signal-to-noise ratio comprises the step of subtracting a running mean of energy of a noise signal {overscore (E)}N from a running mean of energy of a voice signal RMEAN
— - E.
-
9. A voice activity detector (VAD) for making a voicing decision on an incoming speech signal frame, the VAD comprising:
-
an extractor for extracting a predetermined set of parameters, including a pitch gain and a pitch lag, from the incoming speech signal for each frame;
a calculator unit for calculating a set of predetermined values, including a signal-to-noise ratio SNR, based on the extracted predetermined set of parameters and for adaptively determining threshold values according to the SNR value; and
a decision unit for making a frame voicing decision according to the predetermined set of values. - View Dependent Claims (10, 11, 12, 13, 14)
a standard deviation σ
of the pitch lag;
a long-term mean of pitch gain;
a short-term average of energy E, {overscore (E)}s;
a short-term average of LSF, {overscore (LSF)}s;
an average energy {overscore (E)}; and
an average LSF value, {overscore (LSF)}N.
-
-
12. The VAD according to claim 11, wherein the calculator unit further calculates:
-
a spectral difference SD1 using a normalized Itakura-Saito measure;
a spectral difference SD2 using a mean square error method;
a spectral difference SD3 using a mean square error method; and
a long-term mean of SD2.
-
-
13. The VAD according to claim 12, wherein the decision unit makes an initial frame voicing decision according to the values calculated by the calculator unit.
-
14. The VAD according to claim 13, wherein the initial frame voicing decision is smoothed.
- 15. A voice activity detection method for detecting voice activity in an incoming speech signal frame, the improvement comprising making a voicing decision based on a pitch lag and a pitch gain of the speech signal frame and using a signal-to-noise ratio to adaptively set threshold values.
Specification