Method for detecting voice section from time-space by using audio and video information and apparatus thereof

US 9,431,029 B2
Filed: 02/10/2010
Issued: 08/30/2016
Est. Priority Date: 02/27/2009
Status: Active Grant

First Claim

Patent Images

1. A method for detecting a time-space voice section using audio and video information, comprising:

detecting a voice section from an audio signal input to a microphone array;

performing speaker verification by comparing the detected voice section to a voice model constructed in advance;

detecting a speaker'"'"'s face by using a video signal input to a camera in response to the verification of the voice section;

estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and

determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized,wherein the detecting the voice section includes;

estimating a position of a sound source by using the audio signal input to the microphone array;

distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and

detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a method for detecting a voice section in time-space by using audio and video information. According to an embodiment of the present invention, a method for detecting a voice section from time-space by using audio and video information comprises the steps of: detecting a voice section in an audio signal which is inputted into a microphone array; verifying a speaker from the detected voice section; sensing the face of the speaker by using a video signal which is inputted into a camera if the speaker is successfully verified, and then estimating the direction of the face of the speaker; and determining the detected voice section as the voice section of the speaker if the estimated face direction corresponds to a reference direction which is previously stored.

18 Citations

View as Search Results

16 Claims

1. A method for detecting a time-space voice section using audio and video information, comprising:
- detecting a voice section from an audio signal input to a microphone array;
  
  performing speaker verification by comparing the detected voice section to a voice model constructed in advance;
  
  detecting a speaker'"'"'s face by using a video signal input to a camera in response to the verification of the voice section;
  
  estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and
  
  determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized,wherein the detecting the voice section includes;
  
  estimating a position of a sound source by using the audio signal input to the microphone array;
  
  distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and
  
  detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.
- View Dependent Claims (2, 3, 4)
- - 2. The method for detecting the time-space voice section using the audio and video information according to claim 1,wherein the performing the speaker verification includes:
    - changing a value of the reference position as the estimated position of the sound source when the speaker verification succeeds.
  - 3. The method for detecting the time-space voice section using the audio and video information according to claim 1,wherein the estimating the position of the sound source is using a signal with a certain SNR or more in the audio signal input to the microphone array.
  - 4. The method for detecting the time-space voice section using the audio and video information according to claim 1,wherein the removing the distinguished noise includes:
    - removing a signal of a sound source estimated as a position different from the previously stored position.

5. A method for detecting a time-space voice section using audio and video information, comprising:
- estimating a position of a sound source by using an audio signal input to a microphone array;
  
  detecting a voice section in the audio signal in an instance in which the estimated position of the sound source does not match a previously stored reference direction by at least a threshold value;
  
  performing speaker verification by comparing the detected voice section to a voice model constructed in advance;
  
  detecting a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section;
  
  estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and
  
  determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches the previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized,wherein the detecting the voice section includes;
  
  estimating a position of a sound source by using the audio signal input to the microphone array;
  
  distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and
  
  detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.
- View Dependent Claims (6, 7)
- - 6. The method for detecting the time-space voice section using the audio and video information according to claim 5,wherein the performing the speaker verification includes:
    - changing a value of the reference position as the estimated position of the sound source when the speaker verification succeeds.
  - 7. The method for detecting a time-space voice section using audio and video information according to claim 5,wherein the removing the distinguished noise includes:
    - removing a signal of a sound source estimated as a position different from the previously stored position.

8. An apparatus for detecting a time-space voice section using audio and video information, comprising:
- a voice section detection unit configured to detect a voice section in an audio signal input to a microphone array;
  
  a speaker verification unit configured to perform speaker verification by comparing the detected voice section to a voice model constructed in advance; and
  
  a face direction verification unit configured to detect a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section, configured to estimate a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal, and configured to determine the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized,wherein the detecting the voice section includes;
  
  estimating a position of a sound source by using the audio signal input to the microphone array;
  
  distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and
  
  detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.
- View Dependent Claims (9)
- - 9. The apparatus for detecting a time-space voice section using audio and video information according to claim 8,wherein the removing the distinguished noise includes:
    - removing a signal of a sound source estimated as a position different from the previously stored position.

10. An apparatus for detecting a time-space voice section using audio and video information, comprising:
- a sound source position tracking unit configured to estimate a position of a sound source by using an audio signal input to a microphone array;
  
  a voice section detection unit configured to detect a voice section in the audio signal in an instance in which the estimated position of the sound source does not match a previously stored reference direction by at least a threshold value;
  
  a speaker verification unit configured to perform speaker verification by comparing the detected voice section to a voice model constructed in advance; and
  
  a face direction verification unit configured to detect a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section, configured to estimate a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal, and configured to determine the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches the previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized ignored,wherein the detecting the voice section includes;
  
  estimating a position of a sound source by using the audio signal input to the microphone array;
  
  distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and
  
  detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.
- View Dependent Claims (11, 12)
- - 11. The apparatus for detecting the time-space voice section using the audio and video information according to claim 10,wherein the speaker verification unit is configured to change a value of the reference position as the position of the estimated sound source when the speaker verification succeeds.
  - 12. The apparatus for detecting a time-space voice section using audio and video information according to claim 10,wherein the removing the distinguished noise includes:
    - removing a signal of a sound source estimated as a position different from the previously stored position.

13. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code portions stored therein, the computer-executable program code portions comprising program code instructions for:
- detecting a voice section from an audio signal input to a microphone array;
  
  performing speaker verification by comparing the detected voice section to a voice model constructed in advance;
  
  detecting a speaker'"'"'s face by using a video signal input to a camera in response to the verification of the voice section;
  
  estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and
  
  determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches a previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized,wherein the detecting the voice section includes;
  
  estimating a position of a sound source by using the audio signal input to the microphone array;
  
  distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and
  
  detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.
- View Dependent Claims (14)
- - 14. The computer program product according to claim 13,wherein the removing the distinguished noise includes:
    - removing a signal of a sound source estimated as a position different from the previously stored position.

15. A computer program product comprising at least one non-transitory computer-readable storage medium having computer-executable program code portions stored therein, the computer-executable program code portions comprising program code instructions for:
- estimating a position of a sound source by using an audio signal input to a microphone array;
  
  detecting a voice section in the audio signal in an instance in which the estimated position of the sound source does not match a previously stored reference direction by at least a threshold value;
  
  performing speaker verification by comparing the detected voice section to a voice model constructed in advance;
  
  detecting a speaker'"'"'s face using a video signal input to a camera in response to the verification of the voice section;
  
  estimating a speaker'"'"'s face direction based on the direction of the speaker'"'"'s face in the video signal; and
  
  determining the detected voice section as a speaker'"'"'s voice section for voice recognition when the estimated face direction matches the previously stored reference direction so that voices generated from speakers that do not match the previously stored reference direction are not recognized,wherein the detecting the voice section includes;
  
  estimating a position of a sound source by using the audio signal input to the microphone array;
  
  distinguishing noise by comparing the estimated position of the sound source and the previously stored reference direction with each otherremoving the distinguished noise; and
  
  detecting a voice section on the basis of a single microphone in the signal of which the noise is removed.
- View Dependent Claims (16)
- - 16. The computer program product according to claim 15,wherein the removing the distinguished noise includes:
    - removing a signal of a sound source estimated as a position different from the previously stored position.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Korea University Industrial & Academic Collaboration Foundation
Original Assignee
Korea University Industrial & Academic Collaboration Foundation
Inventors
Lee, Hyeopwoo, Yook, Dongsuk
Primary Examiner(s)
Azad, Abul

Application Number

US13/203,387
Publication Number

US 20120078624A1
Time in Patent Office

2,393 Days
Field of Search

701/30.5, 701/30.7, 701/30.8, 701/30.1, 701/23, 701/30.3, 701/30.4, 318/568.12, 318/568.16, 318/568.24, 704/233, 704/243, 704/244, 704/246, 704/247, 704/252, 704/E15.04, 704/E15.041, 704/E15.042, 704/E15.044, 704/E15.045, 704/E15.046, 704/E15.047, 704/E15.049, 704/E17.001, 704/E17.003, 704/E17.015, 704/E21.019, 704/E21.02, 704/253, 704/254
US Class Current

1/1
CPC Class Codes

G10L 17/00   Speaker identification or v...

G10L 2021/02166   Microphone arrays; Beamforming

G10L 25/78   Detection of presence or ab...

Method for detecting voice section from time-space by using audio and video information and apparatus thereof

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Method for detecting voice section from time-space by using audio and video information and apparatus thereof

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links