System and process for locating a speaker using 360 degree sound source localization
First Claim
1. A computer-implemented process for finding the location of a person speaking using signals output by a microphone array having a plurality of audio sensors, comprising using a computer to perform the following process actions:
- inputting the signal generated by each audio sensor of the microphone array;
distinguishing the portion of each of the array sensor signals that contains human speech data from non-speech portions;
reducing noise attributable to stationary sources in each of the array sensor signals;
locating the position of the person speaking using a time-delay-of-arrival (TDOA) based sound source localization (SSL) technique on those portions of the array sensor signals that contain human speech data; and
wherein,distinguishing the portion of each of the array sensor signals that contains human speech data from the non-speech portions, comprises, for each array sensor signal, the actions of,sampling the signal to produce a sequence of consecutive blocks of the signal data representing the output of the sensor over a prescribed period of time,converting each block of signal data to the frequency domain,initializing the distinguishing action using three consecutive blocks of signal data, said initializing comprising the actions of,computing the total energy of the blocks,computing the delta energy of the third block in the sequence by computing the difference between the total energy of said third block and that of the second block in the sequence,computing a noise floor energy for the second and third blocks, andcomputing the delta energy of the noise floor for the third block which represents the difference of the noise floor energy value computed for the third and that computed for the second block, andfor each consecutive block of signal data starting with the third block employed in the initialization action,computing the total energy of the block if not previously computed,computing the delta energy of the block if not previously computed, wherein the delta energy represents the difference in total energy between the block under consideration and that of the immediately preceding block of signal data,computing the delta energy of the noise floor of the block if not previously computed, wherein the delta noise floor energy represents the difference between the last-computed noise floor energy value and that associated with the immediately preceding block of signal data,determining whether the total energy of the block exceeds a prescribed multiple of the energy of the noise floor of the block and whether the delta energy of the block exceeds a prescribed multiple of the delta energy of the noise floor of the block, andwhenever it is determined that the total energy of the block exceeds the prescribed multiple of the energy of the noise floor of the block and the delta energy of the block exceeds the prescribed multiple of the delta energy of the noise floor of the block, designating the block as one containing human speech components.
2 Assignments
0 Petitions
Accused Products
Abstract
A system and process is described for estimating the location of a speaker using signals output by a microphone array characterized by multiple pairs of audio sensors. The location of a speaker is estimated by first determining whether the signal data contains human speech components and filtering out noise attributable to stationary sources. The location of the person speaking is then estimated using a time-delay-of-arrival based SSL technique on those parts of the data determined to contain human speech components. A consensus location for the speaker is computed from the individual location estimates associated with each pair of microphone array audio sensors taking into consideration the uncertainty of each estimate. A final consensus location is also computed from the individual consensus locations computed over a prescribed number of sampling periods using a temporal filtering technique.
108 Citations
25 Claims
-
1. A computer-implemented process for finding the location of a person speaking using signals output by a microphone array having a plurality of audio sensors, comprising using a computer to perform the following process actions:
-
inputting the signal generated by each audio sensor of the microphone array; distinguishing the portion of each of the array sensor signals that contains human speech data from non-speech portions; reducing noise attributable to stationary sources in each of the array sensor signals; locating the position of the person speaking using a time-delay-of-arrival (TDOA) based sound source localization (SSL) technique on those portions of the array sensor signals that contain human speech data; and
wherein,distinguishing the portion of each of the array sensor signals that contains human speech data from the non-speech portions, comprises, for each array sensor signal, the actions of, sampling the signal to produce a sequence of consecutive blocks of the signal data representing the output of the sensor over a prescribed period of time, converting each block of signal data to the frequency domain, initializing the distinguishing action using three consecutive blocks of signal data, said initializing comprising the actions of, computing the total energy of the blocks, computing the delta energy of the third block in the sequence by computing the difference between the total energy of said third block and that of the second block in the sequence, computing a noise floor energy for the second and third blocks, and computing the delta energy of the noise floor for the third block which represents the difference of the noise floor energy value computed for the third and that computed for the second block, and for each consecutive block of signal data starting with the third block employed in the initialization action, computing the total energy of the block if not previously computed, computing the delta energy of the block if not previously computed, wherein the delta energy represents the difference in total energy between the block under consideration and that of the immediately preceding block of signal data, computing the delta energy of the noise floor of the block if not previously computed, wherein the delta noise floor energy represents the difference between the last-computed noise floor energy value and that associated with the immediately preceding block of signal data, determining whether the total energy of the block exceeds a prescribed multiple of the energy of the noise floor of the block and whether the delta energy of the block exceeds a prescribed multiple of the delta energy of the noise floor of the block, and whenever it is determined that the total energy of the block exceeds the prescribed multiple of the energy of the noise floor of the block and the delta energy of the block exceeds the prescribed multiple of the delta energy of the noise floor of the block, designating the block as one containing human speech components. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system for estimating the location of a person speaking, comprising:
-
a microphone array having two or more audio sensor pairs, wherein at least two of said two or more pairs of audio sensors are located such that each sensor of each of the two sensor pairs is separated from the other by a prescribed distance, which need not be the same distance for both pairs, and wherein said two pairs of sensors have baselines defined as the line connecting the two sensor of the audio sensor pair which intersect at an intersection point; a general purpose computing device, comprising a separate stereo-pair sound card for each of said pairs of audio sensors, and wherein for each sound card, the output of each sensor in the associated pair of sensors is input to the sound card and the outputs of the sensor pair are synchronized by the sound card; a computer program comprising program modules executable by the computing device, wherein the computing device is directed by the program modules of the computer program to, input signals generated by each audio sensor of the microphone array; simultaneously sample the inputted signals to produce a sequence of consecutive blocks of the signal data from each signal, wherein each block of signal data is captured over a prescribed period of time and is at least substantially contemporaneous with blocks of the other signals sampled at the same time; for each block of signal data, determine whether the block contains human speech data; filter out noise attributable to stationary sources in each of the blocks of the signal data determined to contain human speech data; estimate the location of the person speaking using a time-delay-of-arrival (TDOA) based sound source localization (SSL) technique on the contemporaneous blocks of filtered signal data determined to contain human speech data for each pair of audio sensors; and compute a consensus estimated location for the person speaking from the individual location estimates determined from the contemporaneous blocks of filtered signal data found to contain human speech data of each pair of audio sensors. - View Dependent Claims (21, 22, 23, 24, 25)
-
Specification