Method for detecting voice section from time-space by using audio and video information and apparatus thereof
First Claim
1. A method for detecting a time-space voice section using audio and video information, comprising:
- detecting a voice section from an audio signal input to a microphone array;
performing speaker verification by comparing the detected voice section to a voice model constructed in advance;
detecting a speaker'"'"'s face by using a video signal input to a camera in response to the verification of the voice section;
estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and
determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized,wherein the detecting the voice section includes;
estimating a position of a sound source by using the audio signal input to the microphone array;
distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and
detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relates to a method for detecting a voice section in time-space by using audio and video information. According to an embodiment of the present invention, a method for detecting a voice section from time-space by using audio and video information comprises the steps of: detecting a voice section in an audio signal which is inputted into a microphone array; verifying a speaker from the detected voice section; sensing the face of the speaker by using a video signal which is inputted into a camera if the speaker is successfully verified, and then estimating the direction of the face of the speaker; and determining the detected voice section as the voice section of the speaker if the estimated face direction corresponds to a reference direction which is previously stored.
18 Citations
16 Claims
-
1. A method for detecting a time-space voice section using audio and video information, comprising:
-
detecting a voice section from an audio signal input to a microphone array; performing speaker verification by comparing the detected voice section to a voice model constructed in advance; detecting a speaker'"'"'s face by using a video signal input to a camera in response to the verification of the voice section; estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized, wherein the detecting the voice section includes; estimating a position of a sound source by using the audio signal input to the microphone array; distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each other removing the distinguished noise; and detecting a voice section on the basis of a single microphone in the signal of which the noise is removed. - View Dependent Claims (2, 3, 4)
-
-
5. A method for detecting a time-space voice section using audio and video information, comprising:
-
estimating a position of a sound source by using an audio signal input to a microphone array; detecting a voice section in the audio signal in an instance in which the estimated position of the sound source does not match a previously stored reference direction by at least a threshold value; performing speaker verification by comparing the detected voice section to a voice model constructed in advance; detecting a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section; estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches the previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized, wherein the detecting the voice section includes; estimating a position of a sound source by using the audio signal input to the microphone array; distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each other removing the distinguished noise; and detecting a voice section on the basis of a single microphone in the signal of which the noise is removed. - View Dependent Claims (6, 7)
-
-
8. An apparatus for detecting a time-space voice section using audio and video information, comprising:
-
a voice section detection unit configured to detect a voice section in an audio signal input to a microphone array; a speaker verification unit configured to perform speaker verification by comparing the detected voice section to a voice model constructed in advance; and a face direction verification unit configured to detect a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section, configured to estimate a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal, and configured to determine the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized, wherein the detecting the voice section includes; estimating a position of a sound source by using the audio signal input to the microphone array; distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each other removing the distinguished noise; and detecting a voice section on the basis of a single microphone in the signal of which the noise is removed. - View Dependent Claims (9)
-
-
10. An apparatus for detecting a time-space voice section using audio and video information, comprising:
-
a sound source position tracking unit configured to estimate a position of a sound source by using an audio signal input to a microphone array; a voice section detection unit configured to detect a voice section in the audio signal in an instance in which the estimated position of the sound source does not match a previously stored reference direction by at least a threshold value; a speaker verification unit configured to perform speaker verification by comparing the detected voice section to a voice model constructed in advance; and a face direction verification unit configured to detect a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section, configured to estimate a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal, and configured to determine the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches the previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized ignored, wherein the detecting the voice section includes; estimating a position of a sound source by using the audio signal input to the microphone array; distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each other removing the distinguished noise; and detecting a voice section on the basis of a single microphone in the signal of which the noise is removed. - View Dependent Claims (11, 12)
-
-
13. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code portions stored therein, the computer-executable program code portions comprising program code instructions for:
-
detecting a voice section from an audio signal input to a microphone array; performing speaker verification by comparing the detected voice section to a voice model constructed in advance; detecting a speaker'"'"'s face by using a video signal input to a camera in response to the verification of the voice section; estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized, wherein the detecting the voice section includes; estimating a position of a sound source by using the audio signal input to the microphone array; distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each other removing the distinguished noise; and detecting a voice section on the basis of a single microphone in the signal of which the noise is removed. - View Dependent Claims (14)
-
-
15. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code portions stored therein, the computer-executable program code portions comprising program code instructions for:
-
estimating a position of a sound source by using an audio signal input to a microphone array; detecting a voice section in the audio signal in an instance in which the estimated position of the sound source does not match a previously stored reference direction by at least a threshold value; performing speaker verification by comparing the detected voice section to a voice model constructed in advance; detecting a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section; estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches the previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized, wherein the detecting the voice section includes; estimating a position of a sound source by using the audio signal input to the microphone array; distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each other removing the distinguished noise; and detecting a voice section on the basis of a single microphone in the signal of which the noise is removed. - View Dependent Claims (16)
-
Specification