Energy-based sound source localization and gain normalization

US 7,924,655 B2
Filed: 01/16/2007
Issued: 04/12/2011
Est. Priority Date: 01/16/2007
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented process for determining the location of one or more people speaking in a room captured by an ad hoc microphone network, comprising the process actions of:

inputting audio streams of people speaking, each audio signal being captured with a microphone on a computing device; and

segmenting each audio stream to find the person closest to each microphone;

finding the average energy of the person closest to each microphone;

using the average energy of the person closest to each microphone, to compute the gain of each microphone;

using the average energy of the person closest to each microphone, computing the attenuation of each person'"'"'s speech when it reaches each microphone;

using the attenuation of each person'"'"'s speech to find the distance between each microphone; and

using the distance between each microphone to find the coordinates of each microphone and the person closest to each microphone, assuming that the person closest to each microphone is at the same location as the microphone.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An energy based technique to estimate the positions of people speaking from an ad hoc network of microphones. The present technique does not require accurate synchronization of the microphones. In addition, a technique to normalize the gains of the microphones based on people'"'"'s speech is presented, which allows aggregation of various audio channels from the ad hoc microphone network into a single stream for audio conferencing. The technique is invariant of the speaker'"'"'s volumes thus making the system easy to deploy in practice.

Citations

20 Claims

1. A computer-implemented process for determining the location of one or more people speaking in a room captured by an ad hoc microphone network, comprising the process actions of:
- inputting audio streams of people speaking, each audio signal being captured with a microphone on a computing device; and
  
  segmenting each audio stream to find the person closest to each microphone;
  
  finding the average energy of the person closest to each microphone;
  
  using the average energy of the person closest to each microphone, to compute the gain of each microphone;
  
  using the average energy of the person closest to each microphone, computing the attenuation of each person'"'"'s speech when it reaches each microphone;
  
  using the attenuation of each person'"'"'s speech to find the distance between each microphone; and
  
  using the distance between each microphone to find the coordinates of each microphone and the person closest to each microphone, assuming that the person closest to each microphone is at the same location as the microphone.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer-implemented process of claim 1 further comprising using at least one of the coordinates of each microphone and the person closest to each microphone, and the gain of each microphone to improve captured audio or video of the people speaking.
  - 3. The computer-implemented process of claim 1 wherein Metric Multidimensional Scaling is used to obtain the coordinates for each microphone.
  - 4. The computer-implemented process of claim 1 wherein coordinates of the microphone and the person closest to each microphone are used for sound source localization to improve the audio stream of a person speaking.
  - 5. The computer-implemented process of claim 1 wherein coordinates of the microphone and the person closest to each microphone are used for selecting and displaying video of the person closest to each microphone speaking.
  - 6. The computer-implemented process of claim 1 wherein coordinates of the microphone and the person closest to each microphone are used for displaying contact information of the person closest to each microphone when that person is speaking.
  - 7. The computer-implemented process of claim 1 wherein the gain of at least one microphone is used for gain normalization.
  - 8. The computer-implemented process of claim 1, further comprising:
    - computing an average energy ratio, the ratio of the average energy of the audio stream of a speaker that does not have a microphone to a first microphone over the average energy of the audio stream of the speaker that does not have a microphone to a second microphone;
      
      using the average energy ratio to compute an attenuation ratio, the ratio of the attenuation of the audio stream of the speaker that does not have a microphone to a first microphone over the attenuation of the audio stream of the speaker that does not have a microphone to a second microphone;
      
      using the attenuation ratio to find a distance ratio, the ratio of the distance of the speaker that does not have a microphone to a first microphone over the distance of the speaker that does not have a microphone to a second microphone; and
      
      using the distance ratio to find the coordinates of the speaker that does not have a microphone.
  - 9. The computer-implemented process of claim 1 wherein segmenting each audio stream to find the person closest to each microphone, comprises:
    - recording each person speaking in an audio file;
      
      segmenting all audio files into segments by detecting the first speech frame and aligning the segments across the audio files; and
      
      for each segment, finding the audio file that has the highest signal to noise ratio and designating this as the speaker that corresponds to the microphone that captured that audio file.

10. A computer-implemented process for determining and using the location of people speaking in a room captured by an ad hoc microphone network, comprising:
- inputting audio streams of people speaking, each audio signal being captured with a microphone on a computing device; and
  
  segmenting each audio stream to find the person closest to each microphone;
  
  finding the average energy of the person closest to each microphone;
  
  using the average energy of the person closest to each microphone, to compute the gain of each microphone;
  
  using the average energy of the person closest to each microphone, computing the attenuation of each person'"'"'s speech when it reaches each microphone;
  
  using the attenuation of each person'"'"'s speech that is closest to each microphone to find the distance between each microphone;
  
  using the distance between each microphone to find the coordinates of each microphone and the person closest to each microphone assuming the microphone and the person closest to it are co-located;
  
  computing an average energy ratio, the ratio of the average energy of the audio stream of a speaker that does not have a microphone to a first microphone over the average energy of the audio stream of the speaker that does not have a microphone to a second microphone;
  
  using the average energy ratio to compute an attenuation ratio, the ratio of the attenuation of the audio stream of the speaker that does not have a microphone to a first microphone over the attenuation of the audio stream of the speaker that does not have a microphone to a second microphone;
  
  using the attenuation ratio to find a distance ratio, the ratio of the distance of the speaker that does not have a microphone to a first microphone over the attenuation of the distance of the speaker that does not have a microphone to a second microphone; and
  
  using the distance ratio to find the coordinates of the speaker that does not have a microphone.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The computer-implemented process of claim 10 further comprising using at least one of the coordinates of each microphone and the person closest to each microphone, the coordinates of a person that does not have a microphone, and the gain of each microphone to improve captured audio or video of the people speaking.
  - 12. The computer-implemented process of claim 10 wherein the gain of at least two microphones is used to perform gain normalization.
  - 13. The computer-implemented process of claim 10 wherein using the distance ratio to find the coordinates of the speaker that does not have a microphone is solved by a nonlinear least square solver.
  - 14. A computer-readable medium having computer-executable instructions for performing the process recited in claim 10.

15. A system for improving the audio and video quality of a recorded event, comprising:
- a general purpose computing device;
  
  a computer program comprising program modules executable by the general purpose computing device, wherein the computing device is directed by the program modules of the computer program to,find one or more speakers'"'"' positions by using the average energy of a captured audio segment for each person speaking; and
  
  apply the one or more speakers'"'"' positions to improve the audio or video of a captured event.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15 wherein the module to find the one or more speakers'"'"' positions comprises sub-modules to:
    - compute an average energy ratio, the ratio of the average energy of the audio stream of a speaker to a first microphone over the average energy of the audio stream of the speaker to a second microphone;
      
      use the average energy ratio to compute an attenuation ratio, the ratio of the attenuation of the audio stream of the speaker to a first microphone over the attenuation of the audio stream of the speaker to a second microphone;
      
      use the attenuation ratio to find a distance ratio, the ratio of the distance of the speaker to a first microphone over the distance of the speaker to a second microphone; and
      
      use the distance ratio to find the coordinates of the speaker.
  - 17. The system of claim 15 further comprising program modules to,find one or more speakers'"'"' positions, microphone positions, and gain of the microphones where each person speaking has a computing device with a microphone by using the average energy of a captured audio segment for each person speaking;
    - and,apply at least one of the speakers'"'"' positions, microphone positions, and the gain of the microphones to improve the audio or video of a captured event.
  - 18. The system of claim 17 wherein the module to find the one or more speakers'"'"' positions, microphone positions, and gain of the microphones where each person speaking has a computing device with a microphone comprises sub-modules to:
    - segment received audio streams from each person in a room that speaks to find the average energy of an audio segment for each of the people speaking;
      
      compute the attenuation of a person'"'"'s speech when it reaches each of the microphones and the gain of each of the microphones;
      
      use the attenuations to find the distance between each microphone relative to the other microphones; and
      
      use the distances between each microphone relative to the other microphones to find the coordinates of each microphone and each person speaking, assuming the microphones and the people speaking are co-located.
  - 19. The system of claim 17 wherein the module to apply the speakers'"'"' positions, microphone positions and microphone gains to improve the audio or video of a captured event performs gain normalization to create a single audio stream of the captured event.
  - 20. The system of claim 15 wherein the module to find the one or more speakers'"'"' positions comprises sub-modules to:
    - express the average energy of speaker j, a_ij, in an audio segment in an audio file y_i(t) in the log domain using the coordinates of microphones used to capture the audio segment (u_i,v_i), the coordinates of the speaker j (x_j,y_j), the average energy of j'"'"'s original speech, s_j, the gain of a microphone i, m_i, and the noise measurements of microphone i, N(0,σ
      
      _i²); and
      
      minimize a sum of error functions weighted by the variance of the noise measurements of each microphone to find one or more speakers'"'"' positions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chou, Philip A., He, Li-wei, Zhang, Zhengyou, Chen, Minghua, Liu, Zicheng
Primary Examiner(s)
Pihulic; Dan

Application Number

US11/623,643
Publication Number

US 20080170717A1
Time in Patent Office

1,547 Days
Field of Search

367/124, 367/129, 367/118
US Class Current

367/124
CPC Class Codes

G01S 11/14   using ultrasonic, sonic, or...

G01S 5/30   Determining absolute distan...

H04M 3/56   Arrangements for connecting...

H04N 7/141   between two video terminals...

H04N 7/15   Conference systems

H04R 2420/07   Applications of wireless lo...

H04R 3/005   for combining the signals o...

Energy-based sound source localization and gain normalization

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Energy-based sound source localization and gain normalization

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links