System and process for locating a speaker using 360 degree sound source localization

US 7,305,095 B2
Filed: 07/15/2005
Issued: 12/04/2007
Est. Priority Date: 08/26/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-readable medium having computer-executable instructions for estimating the location of a person speaking using signals output by a microphone array having a plurality of synchronized audio sensor pairs, said computer-executable instructions comprising:

simultaneously sampling the signals to produce a sequence of consecutive blocks of the signal data from each signal, wherein each block of signal data is captured over a prescribed period of time and is at least substantially contemporaneous with blocks of the other signals sampled at the same time;

for each group of contemporaneous blocks of signal data,determining whether a block contains human speech data for each block of signal data,filtering out noise attributable to stationary sources in each of the blocks determined to contain human speech data,estimating the location of the person speaking using a time-delay-of-arrival (TDOA) based sound source localization (SSL) technique on those contemporaneous blocks of signal data determined to contain human speech data for each pair of synchronized audio sensors, andcomputing a consensus estimated location for the person speaking from the individual location estimates determined from the contemporaneous blocks of filtered signal data found to contain human speech data of each pair of synchronized audio sensors;

computing a final consensus location of the person speaking using a temporal filtering technique to combine the individual consensus locations computed over a prescribed number of sampling periods; and

designating the final consensus location as the location of the person speaking.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and process is described for estimating the location of a speaker using signals output by a microphone array characterized by multiple pairs of audio sensors. The location of a speaker is estimated by first determining whether the signal data contains human speech components and filtering out noise attributable to stationary sources. The location of the person speaking is then estimated using a time-delay-of-arrival based SSL technique on those parts of the data determined to contain human speech components. A consensus location for the speaker is computed from the individual location estimates associated with each pair of microphone array audio sensors taking into consideration the uncertainty of each estimate. A final consensus location is also computed from the individual consensus locations computed over a prescribed number of sampling periods using a temporal filtering technique.

Citations

5 Claims

1. A computer-readable medium having computer-executable instructions for estimating the location of a person speaking using signals output by a microphone array having a plurality of synchronized audio sensor pairs, said computer-executable instructions comprising:
- simultaneously sampling the signals to produce a sequence of consecutive blocks of the signal data from each signal, wherein each block of signal data is captured over a prescribed period of time and is at least substantially contemporaneous with blocks of the other signals sampled at the same time;
  
  for each group of contemporaneous blocks of signal data,determining whether a block contains human speech data for each block of signal data,filtering out noise attributable to stationary sources in each of the blocks determined to contain human speech data,estimating the location of the person speaking using a time-delay-of-arrival (TDOA) based sound source localization (SSL) technique on those contemporaneous blocks of signal data determined to contain human speech data for each pair of synchronized audio sensors, andcomputing a consensus estimated location for the person speaking from the individual location estimates determined from the contemporaneous blocks of filtered signal data found to contain human speech data of each pair of synchronized audio sensors;
  
  computing a final consensus location of the person speaking using a temporal filtering technique to combine the individual consensus locations computed over a prescribed number of sampling periods; and
  
  designating the final consensus location as the location of the person speaking.

2. A system for estimating the location of a person speaking, comprising:
- a microphone array having two or more audio sensor pairs;
  
  a general purpose computing device;
  
  a computer program comprising program modules executable by the computing device, wherein the computing device is directed by the program modules of the computer program to,input signals generated by each audio sensor of the microphone array;
  
  simultaneously sample the inputted signals to produce a sequence of consecutive blocks of the signal data from each signal, wherein each block of signal data is captured over a prescribed period of time and is at least substantially contemporaneous with blocks of the other signals sampled at the same time;
  
  for each block of signal data, determine whether the block contains human speech data;
  
  filter out noise attributable to stationary sources in each of the blocks of the signal data determined to contain human speech data;
  
  estimate the location of the person speaking using a time-delay-of-arrival (TDOA) based sound source localization (SSL) technique on the contemporaneous blocks of filtered signal data determined to contain human speech data for each pair of audio sensors; and
  
  compute a consensus estimated location for the person speaking from the individual location estimates determined from the contemporaneous blocks of filtered signal data found to contain human speech data of each pair of audio sensors.
- View Dependent Claims (3, 4, 5)
- - 3. The system of claim 2, further comprising a program module for refining the identified location of the person speaking, said refining module comprising sub-modules for:
    - computing said consensus location whenever the sensor signal data captured in a prescribed sampling period contains human speech data, for a prescribed number of consecutive sampling periods; and
      
      combining the individual computed consensus locations to produce a refined estimate using a temporal filtering technique.
  - 4. The system of claim 3, wherein the temporal filtering technique is one of (i) a median filtering technique, (ii) a kalman filtering technique, and (iii) a particte filtering technique.
  - 5. The system of claim 2, wherein the computing device comprises a separate stereo-pair sound card for each of said pairs of audio sensors, and wherein for each sound card, the output of each sensor in the associated pair of sensor is input to the sound card and the outputs of the sensor pair are synchronized by the sound card.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Rui, Yong
Primary Examiner(s)
Mei; Xu

Application Number

US11/182,142
Publication Number

US 20050265562A1
Time in Patent Office

872 Days
Field of Search

381/91, 381/122, 381/92, 704/214, 704/233, 704/248, 704/253, 348/14.08
US Class Current

381/92
CPC Class Codes

H04R 2201/401 2D or 3D arrays of transducers

H04R 3/005 for combining the signals o...

System and process for locating a speaker using 360 degree sound source localization

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

System and process for locating a speaker using 360 degree sound source localization

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links