Multichannel voice detection in adverse environments

US 20040042626A1
Filed: 08/30/2002
Published: 03/04/2004
Est. Priority Date: 08/30/2002
Status: Active Grant

First Claim

Patent Images

1. A method for determining if a voice is present in a mixed sound signal, the method comprising the steps of:

receiving the mixed sound signal by at least two microphones;

Fast Fourier transforming each received mixed sound signal into the frequency domain;

filtering the transformed signals to output a signal corresponding to a spatial signature of a source;

summing an absolute value squared of the filtered signal over a predetermined range of frequencies; and

comparing the sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A multichannel source activity detection system, e.g., a voice activity detection (VAD) system, and method that exploits spatial localization of a target audio source is provided. The method includes the steps of receiving a mixed sound signal by at least two microphones; Fast Fourier transforming each received mixed sound signal into the frequency domain; filtering the transformed signals to output a signal corresponding to a spatial signature of a source; summing an absolute value squared of the filtered signal over a predetermined range of frequencies; and comparing the sum to a threshold to determine if a voice is present. Additionally, the filtering step includes multiplying the transformed signals by an inverse of a noise spectral power matrix, a vector of channel transfer function ratios, and a source signal spectral power.

51 Citations

View as Search Results

22 Claims

1. A method for determining if a voice is present in a mixed sound signal, the method comprising the steps of:
- receiving the mixed sound signal by at least two microphones;
  
  Fast Fourier transforming each received mixed sound signal into the frequency domain;
  
  filtering the transformed signals to output a signal corresponding to a spatial signature of a source;
  
  summing an absolute value squared of the filtered signal over a predetermined range of frequencies; and
  
  comparing the sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method as in claim 1, further comprising the step of determining the threshold, wherein the determining the threshold step comprises summing an absolute value squared of the transformed signals over the at least two microphones;
    - summing the summed transformed signals over a predetermined range of frequencies to produce a second sum; and
      
      multiplying the second sum by a boosting factor.
  - 3. The method as in claim 1, wherein the filtering step includes multiplying the transformed signals by an inverse of a noise spectral power matrix, a vector of channel transfer function ratios, and a source signal spectral power.
  - 4. The method as in claim 3, wherein the channel transfer function ratios are determined by a direct path mixing model.
  - 5. The method as in claim 3, wherein the source signal spectral power is determined by spectrally subtracting the noise spectral power matrix from a measured signal spectral covariance matrix.

6. A method for determining if a voice is present in a mixed sound signal, the method comprising the steps of:
- receiving the mixed sound signal by at least two microphones;
  
  Fast Fourier transforming each received mixed sound signal into the frequency domain;
  
  filtering the transformed signals to output signals corresponding to a spatial signature for each of a predetermined number of users;
  
  summing separately for each of the users an absolute value squared of the filtered signals over a predetermined range of frequencies;
  
  determining a maximum of the sums; and
  
  comparing the maximum sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present.
- View Dependent Claims (7, 8, 9, 10, 11)
- - 7. The method as in claim 6, wherein if a voice is present, a specific user associated with the maximum sum is determined to be the active speaker.
  - 8. The method as in claim 6, further comprising the step of determining the threshold, wherein the determining the threshold step comprises summing an absolute value squared of the transformed signals over the at least two microphones;
    - summing the summed transformed signals over a predetermined range of frequencies to produce a second sum; and
      
      multiplying the second sum by a boosting factor.
  - 9. The method as in claim 6, wherein the filtering step includes multiplying the transformed signals by an inverse of a noise spectral power matrix, a vector of channel transfer function ratios, and a source signal spectral power.
  - 10. The method as in claim 9, wherein the filtering step is performed for each of the predetermined number of users and the channel transfer function ratio is measured for each user during a calibration.
  - 11. The method as in claim 9, wherein the source signal spectral power is determined by spectrally subtracting the noise spectral power matrix from a measured signal spectral covariance matrix.

12. A voice activity detector for determining if a voice is present in a mixed sound signal comprising:
- at least two microphones for receiving the mixed sound signal;
  
  a Fast Fourier transformer for transforming each received mixed sound signal into the frequency domain;
  
  a filter for filtering the transformed signals to output a signal corresponding to a spatial signature for each of the transformed signals;
  
  a first summer for summing an absolute value squared of the filtered signals over a predetermined range of frequencies; and
  
  a comparator for comparing the sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present.
- View Dependent Claims (13, 14, 15)
- - 13. The voice activity detector as in claim 12, further comprising:
    - a second summer for summing an absolute value squared of the transformed signals over the at least two microphones and for summing the summed transformed signals over a predetermined range of frequencies to produce a second sum; and
      
      a multiplier for multiplying the second sum by a boosting factor to determine the threshold.
  - 14. The voice activity detector as in claim 12, wherein the filter includes a multiplier for multiplying the transformed signals by an inverse of a noise spectral power matrix, a vector of channel transfer function ratios, and a source signal spectral power to determine the signal corresponding to a spatial signature.
  - 15. The voice activity detector as in claim 14, further including a spectral subtractor for spectrally subtracting the noise spectral power matrix from a measured signal spectral covariance matrix to determine the signal spectral power.

16. A voice activity detector for determining if a voice is present in a mixed sound signal comprising:
- at least two microphones for receiving the mixed sound signal;
  
  a Fast Fourier transformer for transforming each received mixed sound signal into the frequency domain;
  
  at least one filter for filtering the transformed signals to output a signal corresponding to a spatial signature for each of a predetermined number of users;
  
  at least one first summer for summing separately for each of the users an absolute value squared of the filtered signals over a predetermined range of frequencies;
  
  a processor for determining a maximum of the sums; and
  
  a comparator for comparing the maximum sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The voice activity detector as in claim 16, wherein if a voice is present, a specific user associated with the maximum sum is determined to be the active speaker.
  - 18. The voice activity detector as in claim 16, further comprising a second summer for summing an absolute value squared of the transformed signals over the at least two microphones and for summing the summed transformed signals over a predetermined range of frequencies to produce a second sum;
    - and a multiplier for multiplying the second sum by a boosting factor to determine the threshold.
  - 19. The voice activity detector as in claim 16, wherein the at least one filter includes a multiplier for multiplying the transformed signals by an inverse of a noise spectral power matrix, a vector of channel transfer function ratios, and a source signal spectral power to determine the signal corresponding to a spatial signature.
  - 20. The voice activity detector as in claim 19, further comprising a calibration unit for determining the channel transfer function ratio for each user during a calibration.
  - 21. The voice activity detector as in claim 19, further including a spectral subtractor for spectrally subtracting the noise spectral power matrix from a measured signal spectral covariance matrix to determine the signal spectral power.

22. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for determining if a voice is present in a mixed sound signal, the method steps comprising:
- receiving the mixed sound signal by at least two microphones;
  
  Fast Fourier transforming each received mixed sound signal into the frequency domain;
  
  filtering the transformed signals to output a signal corresponding to a spatial signature of a source;
  
  summing an absolute value squared of the filtered signal over a predetermined range of frequencies; and
  
  comparing the sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Siemens Corp. (Siemens AG)
Original Assignee
Siemens Corporate Research Incorporated (Siemens AG)
Inventors
Beaugeant, Christophe, Rosca, Justinian, Balan, Radu Victor

Granted Patent

US 7,146,315 B2
Time in Patent Office

Days
Field of Search
US Class Current

381/110
CPC Class Codes

G10L 2021/02165 Two microphones, one receiv...

G10L 25/78 Detection of presence or ab...

Multichannel voice detection in adverse environments

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

51 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Multichannel voice detection in adverse environments

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

51 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links