Speech detection for noisy conditions
First Claim
1. A speech detection system for examining an input signal to determine whether a speech signal is present or absent, comprising:
- a frequency band splitter for splitting said input signal into a plurality of frequency bands, each band representing a band-limited signal energy corresponding to a different range of frequencies;
an energy comparator system for comparing the band-limited signal energy of said plurality of frequency bands with a plurality of thresholds such that each frequency band is compared with at least one threshold associated with that band;
a speech signal state machine coupled to said energy comparator system that switches;
(a) from a speech-absent state to a speech-present state when the band-limited signal energy of at least one of said bands is above at least one of its associated thresholds, and (b) from a speech-present state to a speech-absent state when the band-limited signal energy of at least one of said bands is below at least one of its associated thresholds;
a histogram data structure residing in computer memory accessible to said speech detection system wherein said histogram data structure initially has a size based at least in part on the energy level of the non-speech portion of the input signal, and wherein said histogram data structure is organized by a predetermined number of histogram steps having a step size based at least in part on a mean of accumulated historical data;
a histogram updating module operable to periodically update said histogram data structure based on a portion of the input signal having an energy level falling within the size of the histogram data structure, said histogram updating module further operable to adjust the size of said histogram data structure based on actual operating conditions wherein said histogram updating module periodically adjusts the step size to reflect a change in said mean, thereby affecting adjustment of the size of the histogram data structure based on actual operating conditions; and
an adaptive threshold updating system that employs said histogram data structure to accumulate historical data indicative of a pre-speech silence portion of said input signal within at least one of said frequency bands such that an energy level of greatest magnitude among all energy levels of the historical data defines a noise floor, the updating system using the noise floor to adjust at least one of said plurality of thresholds used by said energy comparator, said historical data being initially limited to a non-speech portion of the input signal.
1 Assignment
0 Petitions
Accused Products
Abstract
The input signal is transformed into the frequency domain and then subdivided into bands corresponding to different frequency ranges. Adaptive thresholds are applied to the data from each frequency band separately. Thus the short-term band-limited energies are tested for the presence or absence of a speech signal. The adaptive threshold values are independently updated for each of the signal paths, using a histogram data structure to accumulate long-term data representing the mean and variance of energy within the respective frequency band. Endpoint detection is performed by a state machine that transitions from the speech absent state to the speech present state, and vice versa, depending on the results of the threshold comparisons. A partial speech detection system handles cases in which the input signal is truncated.
116 Citations
16 Claims
-
1. A speech detection system for examining an input signal to determine whether a speech signal is present or absent, comprising:
-
a frequency band splitter for splitting said input signal into a plurality of frequency bands, each band representing a band-limited signal energy corresponding to a different range of frequencies;
an energy comparator system for comparing the band-limited signal energy of said plurality of frequency bands with a plurality of thresholds such that each frequency band is compared with at least one threshold associated with that band;
a speech signal state machine coupled to said energy comparator system that switches;
(a) from a speech-absent state to a speech-present state when the band-limited signal energy of at least one of said bands is above at least one of its associated thresholds, and (b) from a speech-present state to a speech-absent state when the band-limited signal energy of at least one of said bands is below at least one of its associated thresholds;
a histogram data structure residing in computer memory accessible to said speech detection system wherein said histogram data structure initially has a size based at least in part on the energy level of the non-speech portion of the input signal, and wherein said histogram data structure is organized by a predetermined number of histogram steps having a step size based at least in part on a mean of accumulated historical data;
a histogram updating module operable to periodically update said histogram data structure based on a portion of the input signal having an energy level falling within the size of the histogram data structure, said histogram updating module further operable to adjust the size of said histogram data structure based on actual operating conditions wherein said histogram updating module periodically adjusts the step size to reflect a change in said mean, thereby affecting adjustment of the size of the histogram data structure based on actual operating conditions; and
an adaptive threshold updating system that employs said histogram data structure to accumulate historical data indicative of a pre-speech silence portion of said input signal within at least one of said frequency bands such that an energy level of greatest magnitude among all energy levels of the historical data defines a noise floor, the updating system using the noise floor to adjust at least one of said plurality of thresholds used by said energy comparator, said historical data being initially limited to a non-speech portion of the input signal. - View Dependent Claims (2, 3, 4, 5, 6, 7)
a first threshold as a predetermined offset above the noise floor;
a second threshold as a predetermined percent of said first threshold, said second threshold being less than said first threshold; and
a third threshold as a predetermined multiple of said first threshold, said third threshold being greater than said first threshold; and
wherein said first threshold controls switching from said speech-absent state to said speech-present state; and
wherein said second and third thresholds control switching from said speech-present state to said speech-absent state.
-
-
6. The system of claim 5 wherein said state machine switches from said speech-present state to said speech-absent state if the band-limited signal energy of at least one of said bands is below said second threshold and if the band-limited signal energy of at least one of said bands is below said third threshold.
-
7. The system of claim 1 further comprising a delayed decision buffer that stores data representing a predetermined time increment of said input signal and that inhibits state machine switching from said speech-absent state to said speech-present state if the band-limited signal energy of at least one of said plurality of frequency bands does not exceed at least one threshold throughout said predetermined time increment.
-
8. A method of determining whether a speech signal is present or absent in an input signal, comprising the steps of:
-
splitting said input signal into a plurality of frequency bands, each band representing a band-limited signal energy corresponding to a different range of frequencies;
comparing the band-limited signal energy of said plurality of frequency bands with a plurality of thresholds such that each frequency band is compared with at least one threshold associated with that band;
accumulating historical data indicative of a pre-speech portion of said input signal within at least one of said frequency bands, using said accumulated historical data to define a noise floor based on an energy level of greatest magnitude among all energy levels of said accumulated historical data, and using the noise floor to adjust at least one of said plurality of thresholds, said historical data being initially limited to a non-speech portion of the input signal;
periodically updating a histogram data structure based on a portion of the input signal having an energy level falling within the size of the histogram data structure, said histogram data structure initially having a size based at least in part on the energy level of a non-speech portion of said input signal, wherein said histogram data structure is organized by a predetermined number of histogram steps having a step size based at least in part on a mean of said accumulated historical data, said updating further adjusting the size of said histogram data structure based on actual operating conditions wherein said histogram updating module periodically adjusts the step size to reflect a change in said mean, thereby affecting adjustment of the size of the histogram data structure based on actual operating conditions; and
determining that;
(a) a speech-present state exists when the band-limited signal energy of at least one of said bands is above at least one of its associated thresholds, and (b) a speech-absent state exists when the band-limited signal energy of at least one of said bands is below at least one of its associated thresholds, wherein at least one threshold confirms a validity of said speech-present state determination. - View Dependent Claims (9, 10, 11, 12, 13, 14)
first threshold as a predetermined offset above the noise floor;
a second threshold as a predetermined percent of said first threshold, said second threshold being less than said first threshold; and
a third threshold as a predetermined multiple of said first threshold, said third threshold being greater than said first threshold; and
determining said speech-present state to exist based on said first threshold and determining said speech-absent state to exist based on said second and third thresholds.
-
-
13. The method of claim 12 wherein said speech-absent state is determined to exist if the band-limited signal energy of at least one of said bands is above said second threshold and if the band-limited signal energy of at least one of said bands is above said third threshold.
-
14. The method of claim 8 wherein, in said determining step, said speech-present state does not exist if the band-limited signal energy of at least one of said plurality of frequency bands does not exceed at least one threshold throughout a predetermined increment of time.
-
15. An adaptive threshold updating system for use with a speech detection system, said system comprising:
-
a histogram data structure residing in computer memory accessible to said speech detection system wherein said histogram data structure initially has a size based at least in part on the energy level of the non-speech portion of the input signal, and wherein said histogram data structure is organized by a predetermined number of histogram steps having a step size based at least in part on a mean of accumulated historical data;
a histogram updating module operable to periodically update said histogram data structure based on a portion of the input signal having an energy level falling within the size of the histogram data structure, said histogram updating module further operable to adjust the size of said histogram data structure based on actual operating conditions wherein said histogram updating module periodically adjusts the step size to reflect a change in said mean, thereby affecting adjustment of the size of the histogram data structure based on actual operating conditions;
accumulated historical data residing in said histogram data structure, said accumulated historical data indicative of a pre-speech silence portion of an input signal within at least one frequency band split from the input signal, the frequency band representing a band-limited signal energy corresponding to a different range of frequencies, said accumulated historical data initially limited to a non-speech portion of the input signal; and
a threshold updating module operable to define a noise floor based on an energy level of greatest magnitude among all energy levels of said accumulated historical data, and further operable to use the noise floor to adjust at least one threshold used by said speech detection system. - View Dependent Claims (16)
-
Specification