Speech detection using stochastic confidence measures on the frequency spectrum
First Claim
1. A method for detecting speech from an input speech signal, comprising the steps of:
- sampling the input speech signal over a plurality of frames, each of the frames having a plurality of samples;
determining an energy content value, M(f), for each of a plurality of frequency bands in a first frame of the input speech signal;
normalizing each of the energy content values for the first frame with respect to energy content values from a non-speech part of the input speech signal;
determining a chi-square value for each of the normalized energy content values associated with the first frame; and
comparing the chi-square value to a threshold value, thereby determining if the first frame correlates to the non-speech part of the input speech signal.
3 Assignments
0 Petitions
Accused Products
Abstract
An accurate and reliable method is provided for detecting speech from an input speech signal. A probabilistic approach is used to classify each frame of the speech signal as speech or non-speech. The speech detection method is based on a frequency spectrum extracted from each frame, such that the value for each frequency band is considered to be a random variable and each frame is considered to be an occurrence of these random variables. Using the frequency spectrums from a non-speech part of the speech signal, a known set of random variables is constructed. Next, each unknown frame is evaluated as to whether or not it belongs to this known set of random variables. To do so, a unique random variable (preferably a chi-square value) is formed from the set of random variables associated with the unknown frame. The unique variable is normalized with respect the known set of random variables and then classified as either speech or non-speech using the “Test of Hypothesis”. Thus, each frame that belongs to the known set of random variables is classified as non-speech and each frame that does not belong to the known set of random variables is classified as speech.
-
Citations
10 Claims
-
1. A method for detecting speech from an input speech signal, comprising the steps of:
-
sampling the input speech signal over a plurality of frames, each of the frames having a plurality of samples;
determining an energy content value, M(f), for each of a plurality of frequency bands in a first frame of the input speech signal;
normalizing each of the energy content values for the first frame with respect to energy content values from a non-speech part of the input speech signal;
determining a chi-square value for each of the normalized energy content values associated with the first frame; and
comparing the chi-square value to a threshold value, thereby determining if the first frame correlates to the non-speech part of the input speech signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
determining an energy content value for each of a plurality of frequency bands in at least ten (10) frames at the beginning of the input signal, each of the ten frames being associated with the non-speech part of the input speech signal;
determining a mean value, μ
N(f), at each of the plurality of frequency bands for the energy content values associated with the ten frames of the non-speech part of the input speech signal; and
determining a variance value, σ
N(f), for each mean value associated with the ten frames of the non-speech part of the input speech signal, thereby constructing a noise model from the non-speech part of the input speech signal.
-
-
5. The method of claim 4 wherein the step of normalizing each of the energy content values is according to
-
( n , f ) = M ( n , f ) - μ N ( f ) σ N ( f ) .
-
-
6. The method of claim 5 further comprises the step of using the first frame to verify the validity of the noise model.
-
7. The method of claim 6 wherein the step of using the unknown frame further comprises using an over-estimation measure according to
-
f M Norm ( n , f ) .
-
-
8. The method of claim 1 further comprises the step of normalizing the chi-square value, X, for the unknown frame, prior to comparing the chi-square value to the threshold value, whereby the normalizing is according to
-
F , where F is the degrees of freedom for the chi-square distribution.
-
-
9. The method of claim 1 further comprises the steps of:
-
determining chi-square values for each of the frames associated with the non-speech part of the input speech signal;
determining a mean value, μ
x, and a variance value, σ
x, for the chi-square values associated with the non-speech part of the input speech signal; and
normalizing the chi-square value for the first frame using the mean value and the variance value of the chi-square values, prior to comparing the chi-square value of the first frame to the threshold value.
-
-
10. The method of claim 9 wherein the step of normalizing the chi-square value is according to
-
( n ) = X ( n ) - μ x σ x .
-
Specification