Speech recognition apparatus, speech recognition apparatus and program thereof
First Claim
1. A speech recognition apparatus comprising:
- a microphone array, comprising at least three microphones, each microphone measuring a delay and a sum of peak power for each of a plurality of angles from a horizontal axis and from a vertical axis in response to a white noise source located at a plurality of locations about said microphone array;
a first directional sound source profile database for storing a plurality of first directional sound source profiles, each of said plurality of first directional sound source profiles determining a first direction sound source profile for each of said plurality of locations based on said measuring;
a target location for said microphone array, where a voice and noise are recorded;
a noise suppressor, receiving a voice signal and a noise signal recorded at said target location by said microphone array, said noise suppressor comprising;
an array of delay and sum units, each delay and sum unit introducing a different delay from a range of negative and positive delays into said recording of said voice and said noise signal and producing a sum of peak power for said voice signal associated with each of said plurality of angles from said horizontal axis and with each of said plurality of angles from said vertical axis;
wherein said voice signal associated with an angle of said horizontal axis and an angle of said vertical axis, corresponding to said target location, produces a maximal in-phase sum of peak power signal associated with said target location;
an array of Fourier transform units, each Fourier transform unit corresponding to one of said array of delay and sum units and converting said voice signal from said one of said array of delay and sum units to a voice power distribution for each of a plurality of frequency bands correspondingly associated with each of said plurality of angles from said horizontal axis and from said vertical axis;
an array of second profile fitting units, each said second profile fitting unit approximately decomposing said voice power distribution for each of said plurality of frequency bands, received from each Fourier transform units, providing a number of second profiles corresponding to said plurality of frequency bands, and selecting one of said second profiles based on correlating each of said voice power distributions that are approximately decomposed to each of said plurality of first directional sound source profiles, stored in said first directional sound source profile database, to one direction corresponding to said voice recorded at said target location;
wherein said approximately decomposing comprises evaluating a directional target voice profile that equals a weighted sum of a first directional sound source profile for said white noise source in said one direction of said target location and a non-directional noise profile;
wherein a weight coefficient of said first directional sound source profile and a weight coefficient for said non-directional noise profile are obtained by minimizing an evaluative function; and
wherein a power of only a voice signal, without noise components, is determined for each of said plurality of frequency bands, based on said weight coefficient of said first directional sound source profile and said weight coefficient for said non-directional noise profile;
a spectrum reconstruction unit that receives said power of only a voice signal for each of said plurality of frequency bands for reconstructing said power of only a voice recorded at said target location; and
an output device that outputs said reconstruction of said power of only a voice recorded at said target location as a voice recording, without noise, from said target location.
3 Assignments
0 Petitions
Accused Products
Abstract
Provided is a method for canceling background noise of a sound source other than a target direction sound source in order to realize highly accurate speech recognition, and a system using the same. In terms of directional characteristics of a microphone array, due to a capability of approximating a power distribution of each angle of each of possible various sound source directions by use of a sum of coefficient multiples of a base form angle power distribution of a target sound source measured beforehand by base form angle by using a base form sound, and power distribution of a non-directional background sound by base form, only a component of the target sound source direction is extracted at a noise suppression part. In addition, when the target sound source direction is unknown, at a sound source localization part, a distribution for minimizing the approximate residual is selected from base form angle power distributions of various sound source directions to assume a target sound source direction. Further, maximum likelihood estimation is executed by using voice data of the component of the sound source direction passed through these processes, and a voice model obtained by predetermined modeling of the voice data, and speech recognition is carried out based on an obtained assumption value.
48 Citations
3 Claims
-
1. A speech recognition apparatus comprising:
-
a microphone array, comprising at least three microphones, each microphone measuring a delay and a sum of peak power for each of a plurality of angles from a horizontal axis and from a vertical axis in response to a white noise source located at a plurality of locations about said microphone array; a first directional sound source profile database for storing a plurality of first directional sound source profiles, each of said plurality of first directional sound source profiles determining a first direction sound source profile for each of said plurality of locations based on said measuring; a target location for said microphone array, where a voice and noise are recorded; a noise suppressor, receiving a voice signal and a noise signal recorded at said target location by said microphone array, said noise suppressor comprising; an array of delay and sum units, each delay and sum unit introducing a different delay from a range of negative and positive delays into said recording of said voice and said noise signal and producing a sum of peak power for said voice signal associated with each of said plurality of angles from said horizontal axis and with each of said plurality of angles from said vertical axis; wherein said voice signal associated with an angle of said horizontal axis and an angle of said vertical axis, corresponding to said target location, produces a maximal in-phase sum of peak power signal associated with said target location; an array of Fourier transform units, each Fourier transform unit corresponding to one of said array of delay and sum units and converting said voice signal from said one of said array of delay and sum units to a voice power distribution for each of a plurality of frequency bands correspondingly associated with each of said plurality of angles from said horizontal axis and from said vertical axis; an array of second profile fitting units, each said second profile fitting unit approximately decomposing said voice power distribution for each of said plurality of frequency bands, received from each Fourier transform units, providing a number of second profiles corresponding to said plurality of frequency bands, and selecting one of said second profiles based on correlating each of said voice power distributions that are approximately decomposed to each of said plurality of first directional sound source profiles, stored in said first directional sound source profile database, to one direction corresponding to said voice recorded at said target location; wherein said approximately decomposing comprises evaluating a directional target voice profile that equals a weighted sum of a first directional sound source profile for said white noise source in said one direction of said target location and a non-directional noise profile; wherein a weight coefficient of said first directional sound source profile and a weight coefficient for said non-directional noise profile are obtained by minimizing an evaluative function; and wherein a power of only a voice signal, without noise components, is determined for each of said plurality of frequency bands, based on said weight coefficient of said first directional sound source profile and said weight coefficient for said non-directional noise profile; a spectrum reconstruction unit that receives said power of only a voice signal for each of said plurality of frequency bands for reconstructing said power of only a voice recorded at said target location; and an output device that outputs said reconstruction of said power of only a voice recorded at said target location as a voice recording, without noise, from said target location.
-
-
2. A speech recognition method measuring for each microphone of a microphone array, comprising at least three microphones, a delay and a sum of peak power for each of a plurality of angles from a horizontal axis and from a vertical axis in response to a white noise source located at a plurality of locations about said microphone array;
-
determining a first direction sound source profile for each of said plurality of locations based on said measuring; storing a plurality of first directional sound source profiles in a first directional sound source profile database; subsequently, recording a voice, located at a target location, and noise from said microphone array; inputting a voice signal, recorded from said target location, and a noise signal from said recording into a noise suppressor for noise suppressing, said noise suppressing comprising; introducing different a delay, from a range of negative and positive delays, into said recording of said voice signal and said noise signal by an array of delay and sum units, each said delay producing a sum of peak power for said voice signal associated with each of said plurality of angles from said horizontal axis and with each of said plurality of angles from said vertical axis; wherein said voice signal associated with an angle of said horizontal axis and an angle of said vertical axis, corresponding to said target location, produces a maximal in-phase sum of peak power signal associated with said target location; performing Fourier transforms by an array of Fourier transform units on signals received from said array of delay and sum units, each Fourier transform unit corresponding to one of said array of delay and sum units and converting said voice signal from said one of said array of delay and sum units to a voice power distribution for each of a plurality of frequency bands correspondingly associated with each of said plurality of angles from said horizontal axis and from said vertical axis; approximately decomposing said voice power distributions, received from each of said Fourier transform units for each one of said plurality of frequency bands, by an array of second profile fitting units, each said second profile fitting unit providing a number of second profiles corresponding to said plurality of frequency bands and selecting one of said second profiles based on correlating each of said voice power distributions that are approximately decomposed to each of said plurality of first directional sound source profiles, stored in said first directional sound source profile database, to one direction corresponding to said voice recorded at said target location; wherein said approximately decomposing comprises evaluating a directional target voice profile that equals a weighted sum of a first directional sound source profile for said white noise source in said one direction of said target location and a non-directional noise profile; wherein a weight coefficient of said first directional sound source profile and a weight coefficient for said non-directional noise profile are obtained by minimizing an evaluative function; and wherein a power of only a voice signal, without noise, is determined for each said plurality of frequency bands, based on said weight coefficient of said first directional sound source profile and said weight coefficient for said non-directional noise profile; transferring said power of only a voice signal for each of said plurality of frequency bands to a spectrum reconstruction unit for reconstructing said power of only a voice recorded at said target location; and outputting said reconstructing of said power of only a voice recorded at said target location as a voice recording, without noise, from said target location.
-
-
3. A program storage device readable by machine, tangibly embodying a program of instructions executable by said machine to perform a method of speech recognition, said method comprising:
-
measuring for each microphone of a microphone array, comprising at least three microphones, a delay and a sum of peak power for each of a plurality of angles from a horizontal axis and from a vertical axis in response to a white noise source located at a plurality of locations about said microphone array; determining a first direction sound source profile for each of said plurality of locations based on said measuring; storing a plurality of first directional sound source profiles in a first directional sound source profile database; subsequently, recording a voice, located at a target location, and noise from said microphone array; inputting a voice signal, recorded from said target location, and a noise signal from said recording into a noise suppressor for noise suppressing, said noise suppressing comprising; introducing different a delay, from a range of negative and positive delays, into said recording of said voice signal and said noise signal by an array of delay and sum units, each said delay producing a sum of peak power for said voice signal associated with each of said plurality of angles from said horizontal axis and with each of said plurality of angles from said vertical axis; wherein said voice signal associated with an angle of said horizontal axis and an angle of said vertical axis, corresponding to said target location, produces a maximal in-phase sum of peak power signal associated with said target location; performing Fourier transforms by an array of Fourier transform units on signals received from said array of delay and sum units, each Fourier transform unit corresponding to one of said array of delay and sum units and converting said voice signal from said one of said array of delay and sum units to a voice power distribution for each of a plurality of frequency bands correspondingly associated with each of said plurality of angles from said horizontal axis and from said vertical axis; approximately decomposing said voice power distributions, received from each of said Fourier transform units for each one of said plurality of frequency bands, by an array of second profile fitting units, each said second profile fitting unit providing a number of second profiles corresponding to said plurality of frequency bands and selecting one of said second profiles based on correlating each of said voice power distributions that are approximately decomposed to each of said plurality of first directional sound source profiles, stored in said first directional sound source profile database, to one direction corresponding to said voice recorded at said target location; wherein said approximately decomposing comprises evaluating a directional target voice profile that equals a weighted sum of a first directional sound source profile for said white noise source in said one direction of said target location and a non-directional noise profile; wherein a weight coefficient of said first directional sound source profile and a weight coefficient for said non-directional noise profile are obtained by minimizing an evaluative function; and wherein a power of only a voice signal, without noise, is determined for each said plurality of frequency bands, based on said weight coefficient of said first directional sound source profile and said weight coefficient for said non-directional noise profile; transferring said power of only a voice signal for each of said plurality of frequency bands to a spectrum reconstruction unit for reconstructing said power of only a voice recorded at said target location; and outputting said reconstructing of said power of only a voice recorded at said target location as a voice recording, without noise, from said target location.
-
Specification