Speech recognition system and method for variable noise environment
First Claim
1. A speech recognition system for activating an automotive vehicle actuator in response to a command spoken instruction signal coupled through and transduced by a microphone while the system is responsive to a recognition-mode command signal, the system responding to the instruction to determine the extent to which the instruction resembles a reference spoken instruction signal previously coupled to the system through the microphone while the system was responsive to a recording-mode command signal, which comprises:
- (a) first means for smoothing the spoken instruction signal coupled through the microphone;
(b) second means for smoothing the spoken instruction signal coupled through the microphone, said second smoothing means having a time constant longer than that of said first smoothing means;
(c) means for switching a smoothed spoken instruction signal derived by said second smoothing means to first and second terminals while the system is respectively activated to the recording and recognition modes in response to the recognition-mode and recording-mode command signals;
(d) first means connected to said first terminal for multiplying the level of the spoken instruction signal smoothed by said second smoothing means while the system is in the recording mode and for deriving a recording-mode threshold level signal corresponding thereto;
(e) second means connected to said second terminal for multiplying the level of the spoken instruction signal smoothed by said second smoothing means while the system is in the recognition mode for deriving a recognition-mode threshold level signal corresponding thereto, the multiplication rate of said second multiplying means being higher than that of said first multiplying means;
(f) means for comparing the smoothed spoken instruction signal level derived by said first smoothing means with (i) the recording-mode threshold level signal derived from said first voltage level multiplying means while the system is in the recording mode and (ii) the recognition-mode threshold level signal derived from said second voltage level multiplying means while the system is in the recognition mode and for deriving (i) a spoken instruction start command signal when the smoothed spoken instruction signal level derived by said first smoothing means exceeds one of the recording-mode and recognition-mode threshold levels for more than a reference start time and (ii) a spoken instruction end command signal when the smoothed spoken instruction signal level derived by said first smoothing means drops below one of the recording-mode and recognition-mode threshold levels for more than a reference end time; and
(g) a speech recognizer for starting recognition of the spoken instruction signal coupled through the microphone in response to the start command signal and stopping recognition of the same signal in response to the end command signal.
1 Assignment
0 Petitions
Accused Products
Abstract
An automotive vehicle speech recognition system sets a recording-mode reference threshold level to be lower than a recognition-mode reference threshold level. Therefore, a spoken instruction supplied to the system in a low voice while the vehicle is parked in a quiet place during a record mode can be correlated with an instruction uttered in a loud voice during a recognition mode while the vehicle is running in a noisy environment. During the record and recognition modes reference threshold signals derived by smoothing a spoken instruction are multiplied by two different factors such that the reference threshold in the recognition mode has a greater multiplication factor than the reference threshold in the record mode. While the driver is uttering a command a first smoothed version of the utterance power spectrum is compared with a fixed reference threshold level, set to the value of a second smoothed version of the utterance power spectrum at the time the utterance began. While no command is being uttered the reference threshold varies in accordance with the second smoothed version. The first smoothed version includes higher frequency components than the second smoothed version.
42 Citations
20 Claims
-
1. A speech recognition system for activating an automotive vehicle actuator in response to a command spoken instruction signal coupled through and transduced by a microphone while the system is responsive to a recognition-mode command signal, the system responding to the instruction to determine the extent to which the instruction resembles a reference spoken instruction signal previously coupled to the system through the microphone while the system was responsive to a recording-mode command signal, which comprises:
-
(a) first means for smoothing the spoken instruction signal coupled through the microphone; (b) second means for smoothing the spoken instruction signal coupled through the microphone, said second smoothing means having a time constant longer than that of said first smoothing means; (c) means for switching a smoothed spoken instruction signal derived by said second smoothing means to first and second terminals while the system is respectively activated to the recording and recognition modes in response to the recognition-mode and recording-mode command signals; (d) first means connected to said first terminal for multiplying the level of the spoken instruction signal smoothed by said second smoothing means while the system is in the recording mode and for deriving a recording-mode threshold level signal corresponding thereto; (e) second means connected to said second terminal for multiplying the level of the spoken instruction signal smoothed by said second smoothing means while the system is in the recognition mode for deriving a recognition-mode threshold level signal corresponding thereto, the multiplication rate of said second multiplying means being higher than that of said first multiplying means; (f) means for comparing the smoothed spoken instruction signal level derived by said first smoothing means with (i) the recording-mode threshold level signal derived from said first voltage level multiplying means while the system is in the recording mode and (ii) the recognition-mode threshold level signal derived from said second voltage level multiplying means while the system is in the recognition mode and for deriving (i) a spoken instruction start command signal when the smoothed spoken instruction signal level derived by said first smoothing means exceeds one of the recording-mode and recognition-mode threshold levels for more than a reference start time and (ii) a spoken instruction end command signal when the smoothed spoken instruction signal level derived by said first smoothing means drops below one of the recording-mode and recognition-mode threshold levels for more than a reference end time; and (g) a speech recognizer for starting recognition of the spoken instruction signal coupled through the microphone in response to the start command signal and stopping recognition of the same signal in response to the end command signal. - View Dependent Claims (3)
-
-
2. A speech recognition system for activating an automotive vehicle actuator in response to a command spoken instruction signal coupled through and transduced by a microphone while the system is responsive to a recognition-mode command signal, the system responding to the instruction to determine the extent to which the instruction resembles a reference spoken instruction signal previously coupled to the system through the microphone while the system was responsive to a recording-mode command signal, which comprises:
-
(a) a spectrum normalizing amplifer connected to said microphone for amplifying and spectrum-normalizing the spoken instruction signal transduced by the microphone; (b) a rectifier connected to said spectrum normalizing amplifier for rectifying the amplified spoken instruction signal; (c) a first smoother connected to said rectifier for smoothing the rectified spoken instruction signal and deriving a first smoothed signal; (d) a second smoother connected to said rectifier for smoothing the rectified spoken instruction signal with a time constant longer than that of said first smoother and deriving a second smoothed signal; (e) an analog switch having a movable contact connected to said second smoother, the movable contact contacting a first fixed contact in response to the recording-mode command signal and to a second fixed contact in response to the recognition-mode command signal; (f) a record multiplier connected to the first fixed contact of said analog switch for multiplying the second smoothed spoken instruction signal while the recording mode instruction command signal is derived and for deriving the recording-mode threshold level signal; (g) a recognition multiplier connected to the second fixed contact of said analog switch for multiplying the second smoothed spoken instruction signal while the recognition mode instruction command signal is derived and for deriving the recognition-mode threshold level signal, the multiplication rate of said recognition multiplier being greater than that of said record multiplier; (h) a holding circuit connected to said record and recognition multipliers for (i) passing the multiplied spoken instruction signal as a reference start threshold level when no holding signal is applied thereto and (ii) holding the multiplied spoken instruction signal as a constant end threshold level when the holding signal is applied thereto and for deriving the held signal thereafter until no holding signal is applied thereto; (i) a level comparator having first and second input terminals respectively connected to said first smoother and to said holding circuit for comparing (i) the first smoothed spoken instruction signal with the reference start threshold level when no holding signal is applied to said holding circuit, and (ii) the first smoothed spoken instruction signal with the reference end threshold level when the holding signal is applied to said holding circuit, said comparator deriving a signal respectively having H and L levels when the amplitude of the signal at the first terminal is greater than the amplitude of the signal at the second terminal and vice versa, in each of the recording mode and recognition mode; (j) a duration comparator connected to said level comparator and said holding circuit for comparing the pulse width of the H-voltage level signal with a reference start time and for deriving a spoken instruction start command signal when the H-voltage level pulse width exceeds the reference start time and for comparing the pulse width of the L-voltage level signal with a reference end time and for deriving a spoken instruction end command signal when the L-voltage level pulse width exceeds the reference end time, the H-voltage level signal from said duration comparator being applied to said holding circuit as the holding signal; and (k) a speech recognizer connected to said duration comparator for starting recognition of the spoken instruction signal coupled through the microphone in response to the start command signal and stopping recognition of the same signal in response to the end command signal.
-
-
4. A speech recognition system for activating an automotive vehicle actuator in response to a command spoken instruction signal coupled through and transduced by a microphone while the system is responsive to a recognition-mode command signal, the system responding to the instruction to determine the extent to which the instruction resembles a reference spoken instruction signal previously coupled to the system through the microphone while the system was responsive to a recording-mode command signal, which comprises:
-
(a) a record switch for deriving a recording-mode command signal when closed; (b) a recognition switch for deriving a recognition-mode command signal when closed; and (c) a microcomputer connected to said record switch and said recognition switch for; (1) deriving a spoken instruction and sampling spoken instruction signal level data Sn for each of plural sampling time intervals, (2) calculating the power Pn of a spoken instruction signal as a first function of the sampled signal level data Sn, (3) determining whether the system is in recording mode or in recognition mode in response to the value of at least one of the recording-mode command signal and the recognition-mode command signal, (4) calculating a recording-mode reference threshold level En1 as a second function of the calculated sample power level data Pn when the microcomputer determines that the system is in the recording-mode, (5) calculating a recognition-mode reference threshold level En2 as a third function of the calculated sample power level data Pn when the microcomputer determines that the system is in the recognition mode, levels En1 and En2 being calculated in such a way that En2 is higher than En1, (6) comparing the sample power level data Pn with the calculated start reference threshold voltage level En, (7) counting the number M1 of sample power level data Pn exceeding the reference threshold voltage level En while a sampled sample power level data Pn exceeds the calculated start reference threshold voltage level En, (8) comparing the counted number M1 with a reference start number W1, (9) deriving a spoken instruction start command signal M3 and storing the sampled signal power level data Pn in a memory sequentially, while the counted number M1 exceeds the reference start number W1, (10) sampling the spoken instruction signal power level data Pn again while the counted number M1 does not exceed the reference start number W1, (11) counting the number M2 of the sample power level data Pn dropping below the threshold level En while the sample power level data Pn is less than the calculated start threshold level En, (12) comparing the counted number M2 with a reference end number W2, (13) deriving a spoken instruction end command signal and storing the sample power level data Pn in a memory sequentially while the counted number M1 exceeds the reference end number W2, (14) sampling the spoken instruction signal power level data Pn while the counted number M2 does not exceed the reference end number W2, and (15) starting recognition of the spoken instruction signal coupled through the microphone in response to the start command signal and stopping recognition of the same signal in response to the end command signal.
-
-
5. A method of detecting the start and end of a spoken instruction coupled through a microphone to a speech recognition system capable of operating in a recognition mode and in a recording mode, the system previously being responsive to a spoken instruction coupled to it while comprising the steps of:
-
(a) sampling spoken instruction signal level data Sn during each of plural sampling time intervals; (b) calculating the power Pn of a spoken instruction signal as a first function of the sampled signal voltage level data Sn; (c) determining whether the system is in recording mode or in recognition mode; (d) calculating a recording-mode reference threshold level En1 as a second function of the calculated sample power level data Pn while the system is determined to be in the recording-mode; (e) calculating a recognition-mode reference threshold level En2 as a third function of the calculated sample power level data Pn while the system is determined to be in the recognition-mode, the second and third functions having constants causing the calculated recording-mode reference threshold level to be lower than the calculated recognition-mode reference threshold level; (f) comparing the sampled power level data Pn with the calculated start reference threshold voltage levels En; (g) counting the number M1 of power level data Pn exceeding the threshold level En while the sample power level data Pn exceeds the calculated start reference threshold level En; (h) comparing the counted number M1 with a reference start number W1 ; (i) deriving a spoken instruction start command signal M3 and storing the sampled power level data Pn in a memory sequentially while the counted number M1 exceeds the reference start number W1 ; (j) returning to step (a) above in response to the counted number M1 not exceeding the reference start number W1 in step (i); (k) counting the number M2 of power level data Pn dropping below the threshold level En while the sample power level data Pn drops below the calculated end reference threshold voltage level En; (l) comparing the counted number M2 with a reference end number W2 ; (m) deriving a spoken instruction end command signal and storing the sample power level data Pn in a memory sequentially while the counted number M2 exceeds the reference end number W2 ; and (n) returning to step (k) in response to the counted number M2 not exceeding the reference end number W2 in step (m). - View Dependent Claims (6, 7)
-
-
8. A speech recognition system capable of operating in environments having different background noise levels such that during a record operating mode the noise level has a tendency to be considerably lower than during a recognition operating mode, the system comprising means for establishing the record and recognition operating modes, means for deriving first and second signals respectively representing power in two different low pass spectra as derived in response to speech uttered for recognition and recording by the system in the presence of background noise in the environment where the utterance occurs, the first signal including frequency components higher than those of the second signal, means for changing the relative values of the first and second signals, signal amplitude comparison means, means coupled to the establishing means, the deriving means, the changing means and the comparison means for activating the comparison means so it derives:
- (a) a first signal level indicating that an utterance is occurring when (i) a first function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than a predetermined interval while the system is in the record operating mode, and (ii) a second function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than the predetermined interval while the system is in the recognition operation mode, (b) a second signal level indicating that an utterance is no longer occurring when (i) the first function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval while the system is in the recording operating mode, and (ii) the second function of the magnitude associated with the first signal is less than the magnitude of the second signal for the set interval while the system is in the recognition operating mode, the first and second functions being respectively of the form
space="preserve" listing-type="equation">f.sub.1 =α
.sub.1 P+β
.sub.1
space="preserve" listing-type="equation">f.sub.2 =α
.sub.2 P+β
.sub.2where f1 =the first function f2 =the second function α
.sub. = a first predetermined, non-zero constantα
2 =a second predetermined, non-zero constantα
1 >
α
2P=a function of the power spectrum of the utterance and the background noise, β
1 =a third predetermined constant that may be zeroβ
2 =a fourth predetermined constant that may be zeroβ
1 >
β
2 unless β
1 =β
2 =0,means for recording signals representing tonal characteristics of the utterance while the first signal level is derived and while the system is in the record operating mode, so that several different tonal characteristics representing signals are recorded for several different utterances, and means for comparing signals representing tonal characteristics of the utterance with the recorded tonal characteristics for the several different utterances while the first signal level is derived and while in the recognition operating mode, the tonal recording and tonal comparing means being disabled while the second signal level is derived. - View Dependent Claims (9, 10, 11, 12, 13)
- (a) a first signal level indicating that an utterance is occurring when (i) a first function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than a predetermined interval while the system is in the record operating mode, and (ii) a second function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than the predetermined interval while the system is in the recognition operation mode, (b) a second signal level indicating that an utterance is no longer occurring when (i) the first function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval while the system is in the recording operating mode, and (ii) the second function of the magnitude associated with the first signal is less than the magnitude of the second signal for the set interval while the system is in the recognition operating mode, the first and second functions being respectively of the form
-
14. A speech recognition system capable of operating in environments having different background noise levels, the system comprising means for establishing record and recognition operating modes, means for deriving first and second signals respectively representing power in two different low pass spectra as derived in response to speech uttered for recognition and recording by the system in the presence of background noise in the environment where the utterance occurs, the first signal including frequency components higher than those of the second signal, signal amplitude comparison means, means coupled to the establishing means, the deriving means, and the comparison means for activating the comparison means so it derives:
- (a) a first signal level indicating that an utterance is occurring when a function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than a predetermined interval while the system is in the record and recognition operating modes, and (b) a second signal level indicating that an utterance is no longer occurring when the function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval while the system is in the recording and recognition modes, the function f, being of the form
space="preserve" listing-type="equation">f=α
P+βwhere α
=a first predetermined non-zero constantβ
=a second predetermined constant that may be zeroP=a function of the power spectrum of the utterance and the background noise such that P is a replica of the utterance and background noise power spectrum while the second signal level is derived, P being a constant value commensurate with the power spectrum of the utterance and background noise at the time the first signal level is initially derived, the constant value of P being derived throughout the interval of the first signal level; means for recording signals representing a voice power spectrum of the utterance while the first signal level is derived and while the system is in the record operating mode so that several different tonal characteristics representing signals are recorded for several different utterances, and means for comparing signals representing voice power spectrum of the utterance with the recorded voice power spectrum for the several different utterances while the first signal level is derived and while the system is in the recognition operating mode, the voice power spectrum recording and voice power spectrum comparing means being disabled while the second signal level is derived. - View Dependent Claims (15)
- (a) a first signal level indicating that an utterance is occurring when a function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than a predetermined interval while the system is in the record and recognition operating modes, and (b) a second signal level indicating that an utterance is no longer occurring when the function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval while the system is in the recording and recognition modes, the function f, being of the form
-
16. A speech recognition method that is performed by a system in environments having different background noise levels such that during a record operating mode the noise level has a tendency to be considerably lower than during a recognition operating mode, the method comprising:
- establishing the record and recognition operating modes in the system,
deriving first and second signals respectively representing power in two different low pass spectra as derived in response to speech uttered for recognition and recording by the system in the presence of background noise in the environment where the utterance is occurring, deriving a first signal level indicating that an utterance is occurring when (i) a first function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than a predetermined interval while the system is in the record operating mode, and (ii) a second function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than the predetermined interval while the system is in the recognition operation mode, deriving a second signal level indicating that an utterance is no longer occurring when (i) the first function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval while the system is in the recording operating mode, and (ii) the second function of the magnitude associated with the first signal is less than the magnitude of the second signal for the set interval while the system is in the recognition operating mode, the first and second functions being respectively of the form
space="preserve" listing-type="equation">f.sub.1 =α
.sub.1 P+β
.sub.1
space="preserve" listing-type="equation">f.sub.2 =α
.sub.2 P+β
.sub.2where f1 =the first function f2 =the second function α
.sub. = a first predetermined, non-zero constantα
2 =a second predetermine, non-zero constantα
1 >
α
2P=a function of the power spectrum of the utterance and the background noise, β
1 =a third predetermined constant that may be zeroβ
2 =a fourth predetermined constant that may be zeroβ
1 >
β
2 unless β
1 =β
2 =0,recording signals representing tonal characteristics of the utterance while the first signal level is derived and while the system is in the record operating mode, so that several different tonal characteristics representing signals are recorded for several different utterances, comparing signals representing tonal characteristics of the utterance with the recorded tonal characteristics for the several different utterances while the first signal level is derived and while the system is in the recognition operating mode, the tonal recording and comparing steps being disabled while the second signal level is derived. - View Dependent Claims (17)
- establishing the record and recognition operating modes in the system,
-
18. A speech recognition method that is performed by a system in environments having different background noise levels, the method comprising:
-
establishing record and recognition operating modes in the system, deriving first and second signals respectively representing power in two different low pass spectra as derived in response to speech uttered for recognition and recording by the system in the presence of background noise in the environment where the utterance is occurring, deriving a first signal level indicating that an utterance is occurring when a function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than a predetermined interval while the system is in the record and recognition operating modes, deriving a second signal level indicating that an utterance is no longer occurring when the function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval while the system is in the recording and recognition operating modes, the function f being of the form
space="preserve" listing-type="equation">f=α
P+βwhere α
=a first predetermined non-zero constantβ
=a second predetermined constant that may be zeroP=a function of the power spectrum of the utterance and the background noise such that P is a replica of the utterance and background noise power spectrum while the second signal level is derived, and P is a constant value commensurate with the power spectrum of the utterance and background noise at the time the first signal level is initially derived, the constant value of P being derived throughout the interval of the first signal level, recording signals representing a voice power spectrum of the utterance while the first signal level is derived and while the system is in the record operating mode so that several different voice power spectrum representing signals are recorded for several different utterances, and comparing signals representing a voice power spectrum of the utterance with the recorded voice power spectrum for the several different utterances while the first signal level is derived and while the system is in the recognition operating mode, the voice power spectrum recording and voice power spectrum comparing steps being disabled while the second signal level is derived.
-
-
19. A speech recognition system capable of operating in environments having different background noise levels, the system previously having recorded therein signals representing a voice power spectrum of speech utterances to be recognized thereby, the system comprising means for deriving first and second signals respectively representing power in two different low pass spectra as derived in response to speech uttered for recognition by the system in the presence of background noise in the environment where the utterance occurs, the first signal including frequency components higher than those of the second signal, signal amplitude comparison means, means coupled to the deriving means and the comparison means for activating the comparison means so it:
-
derives (a) a first signal level indicating that an utterance is occurring when a function of the magnitude associated with the first signalo exceeds the magnitude of the second signal for more than a predetermined interval and (b) a second signal level indicating that an utterance is no longer occurring when the function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval, the function f, being of the form
space="preserve" listing-type="equation">f=α
P+βwhere α
=a first predetermined non-zero constantβ
=a second predetermined constant that may be zeroP=a function of the power spectrum of the utterance and the background noise such that P is a replica of the utterance and background noise power spectrum while the second signal level is derived, P being a constant value commensurate with the power spectrum of the utterance and background noise at the time the first signal is initially derived, the constant value of P being derived throughout the interval of the first signal level; means for comparing signals representing the voice power spectrum of the utterance with the recorded voice power spectrum for the several different utterances while the first signal level is derived; and means for disabling the voice power spectrum comparing means while the second signal level is derived.
-
-
20. A speech recognition method that is performed by a system in environments having different background noise levels, the system previously having recorded therein signals representing a voice power spectrum of speech utterances to be recognized thereby, the method comprising:
-
deriving first and second signals respectively representing power in two different low pass spectra as derived in response to speech uttered for recognition by the system in the presence of background noise in the environment where the utterance is occurring; deriving a first signal level indicating that an utterance is occurring when a function of the magnitude associated with the first signal exceeds the magnitude of the second signal for more than a predetermined interval, deriving a second signal level indicating that an utterance is no longer occurring when the function of the magnitude associated with the first signal is less than the magnitude of the second signal for a set interval, the function f being of the form
space="preserve" listing-type="equation">f=α
P+βwhere α
=a first predetermined non-zero constantβ
=a second predetermined constant that may be zeroP=a function of the power spectrum of the utterance and the background noise such that P is a replica of the utterance and background noise power spectrum while the second signal level is derived, and P is a constant value commensurate with the power spectrum of the utterance and background noise at the time the first signal level is initially derived, the constant value of P being derived throughout the interval of th first signal level, and comparing signals representing the voice power spectrum of the utterance with signals representing the recorded voice power spectrum for the several different utterances while the first signal level is derived, the voice power spectrum comprising step being disabled while the second signal level is derived.
-
Specification