Signal processing apparatus and method
First Claim
Patent Images
1. A speech signal processing apparatus comprising:
- a dividing unit which divides an input speech signal into frames, each of which has a predetermined time length;
a calculation unit which calculates a VAD metric for a current frame;
a determination unit which determines whether a signal in the current frame contains speech or non-speech by using the VAD metric and outputs a VAD flag of 1 or 0 indicating whether the current frame contains speech or non-speech, respectively;
a filter unit which smooths the VAD flags output from said determination unit, wherein said filter unit executes a filter process expressed as follows;
Vf=ρ
Vf−
1+(1−
ρ
)Xf,where;
f is a frame index;
Vf is the filter output of the frame f;
Xf is the filter input of the frame f, which is the VAD flag of the frame f; and
ρ
is a constant value as a pole of the filter; and
a state evaluation unit which, according to the output from said filter unit, Vf, evaluates a current state of the speech signal from among a silence state, a speech state, a possible speech state representing an intermediate state from the silence state to the speech state, and a possible silence state representing an intermediate state from the speech state to the silence state,wherein said state evaluation unit performs the following operations;
in the silence state, when the VAD flag becomes 1, the state moves to the possible speech state,in the possible speech state, when Vf exceeds a first threshold value, the state moves to the speech state and Vf is set to 1, and when Vf is below a second threshold value that is smaller that the first threshold value, the state moves to the silence state,in the speech state, when the VAD flag becomes 0, the state moves to the possible silence state, and in the possible silence state, when Vf is below the second threshold value, the state moves to the silence state and Vf is set to 0, and when the VAD flag becomes 1, the state moves to the speech state.
1 Assignment
0 Petitions
Accused Products
Abstract
A signal processing apparatus and method for performing a robust endpoint detection of a signal are provided. An input signal sequence is divided into frames each of which has a predetermined time length. The presence of the signal in the frame is detected. After that, the filter process of smoothing the detection result by using the detection result for a past frame is applied to the detection result for a current frame. The filter output is compared with a predetermined threshold value to determine the state of the signal sequence of the current frame on the basis of the comparison result.
126 Citations
3 Claims
-
1. A speech signal processing apparatus comprising:
-
a dividing unit which divides an input speech signal into frames, each of which has a predetermined time length; a calculation unit which calculates a VAD metric for a current frame; a determination unit which determines whether a signal in the current frame contains speech or non-speech by using the VAD metric and outputs a VAD flag of 1 or 0 indicating whether the current frame contains speech or non-speech, respectively; a filter unit which smooths the VAD flags output from said determination unit, wherein said filter unit executes a filter process expressed as follows;
Vf=ρ
Vf−
1+(1−
ρ
)Xf,where; f is a frame index; Vf is the filter output of the frame f; Xf is the filter input of the frame f, which is the VAD flag of the frame f; and ρ
is a constant value as a pole of the filter; anda state evaluation unit which, according to the output from said filter unit, Vf, evaluates a current state of the speech signal from among a silence state, a speech state, a possible speech state representing an intermediate state from the silence state to the speech state, and a possible silence state representing an intermediate state from the speech state to the silence state, wherein said state evaluation unit performs the following operations; in the silence state, when the VAD flag becomes 1, the state moves to the possible speech state, in the possible speech state, when Vf exceeds a first threshold value, the state moves to the speech state and Vf is set to 1, and when Vf is below a second threshold value that is smaller that the first threshold value, the state moves to the silence state, in the speech state, when the VAD flag becomes 0, the state moves to the possible silence state, and in the possible silence state, when Vf is below the second threshold value, the state moves to the silence state and Vf is set to 0, and when the VAD flag becomes 1, the state moves to the speech state.
-
-
2. A speech signal processing method comprising the steps of:
-
(a) dividing an input speech signal into frames, each of which has a predetermined time length; (b) calculating a VAD metric for a current frame; (c) determining whether a signal in the current frame contains speech or non-speech by using the VAD metric and outputting a VAD flag of 1 or 0 indicating whether the current frame contains speech or non-speech, respectively; (d) smoothing the VAD flags output from said determination step, wherein said smoothing step executes a filter process expressed as follows;
Vf=ρ
Vf−
1+(1−
ρ
) Xf,where; f is a frame index; Vf is the filter output of the frame f; Xf is the filter input of the frame f, which is the VAD flag of the frame f; and ρ
is a constant value as a pole of the filter; and(e) evaluating, according to the output of said smoothing step, Vf, a current state of the speech signal from among a silence state, a speech state, a possible speech state representing an intermediate state from the silence state to the speech state, and a possible silence state representing an intermediate state from the speech state to the silence state, wherein said evaluating step performs the following operations; in the silence state, when the VAD flag becomes 1, the state moves to the possible speech state, in the possible speech state, when Vf exceeds a first threshold value, the state moves to the speech state and Vf is set to 1, and when Vf is below a second threshold value that is smaller that the first threshold value, the state moves to the silence state, in the speech state, when the VAD flag becomes 0, the state moves to the possible silence state, and in the possible silence state, when Vf is below the second threshold value, the state moves to the silence state and Vf is set to 0, and when the VAD flag becomes 1, the state moves to the speech state.
-
-
3. A computer-readable medium storing program code for causing a computer to perform the steps of:
-
(a) dividing an input speech signal sequence into frames, each of which has a predetermined time length; (b) calculating a VAD metric for a current frame; (c) determining whether a signal in the current frame contains speech or non-speech by using the VAD metric and outputting a VAD flag of 1 or 0 indicating whether the current frame contains speech or non-speech, respectively; (d) smoothing the VAD flags output from said determination step, wherein said smoothing step executes a filter process expressed as follows;
Vf=ρ
Vf−
1+(1−
ρ
) Xf,where; f is a frame index; Vf is the filter output of the frame f; Xf is the filter input of the frame f, which is the VAD flag of the frame f; and ρ
is a constant value as a pole of the filter; and(e) evaluating, according to the output of said smoothing step, Vf, a current state of the speech signal from among a silence state, a speech state, a possible speech state representing an intermediate state from the silence state to the speech state, and a possible silence state representing an intermediate state from the speech state to the silence state, wherein said evaluating step performs the following operations; in the silence state, when the VAD flag becomes 1, the state moves to the possible speech state, in the possible speech state, when Vf exceeds a first threshold value, the state moves to the speech state and Vf is set to 1, and when Vf is below a second threshold value that is smaller that the first threshold value, the state moves to the silence state, in the speech state, when the VAD flag becomes 0, the state moves to the possible silence state, and in the possible silence state, when Vf is below the second threshold value, the state moves to the silence state and Vf is set to 0, and when the VAD flag becomes 1, the state moves to the speech state.
-
Specification