Morphological pure speech detection using valley percentage
First Claim
1. A method for detecting pure speech signals in an audio signal having pure speech and non-speech or mixed-speech signals, the method comprising:
- computing from the audio signal a valley percentage feature, the valley percentage feature representing for a point in the audio signal a proportion of plural surrounding points that are low energy surrounding points, wherein a low energy surrounding point has an energy level that falls below a threshold energy level for the plural surrounding points;
classifying the audio signal into either a pure-speech or non-speech classification according to the valley percentage feature; and
determining the boundaries between a portion of the audio signal classified as pure-speech and a portion of the audio signal classified as non-speech.
3 Assignments
0 Petitions
Accused Products
Abstract
A human speech detection method detects pure-speech signals in an audio signal containing a mixture of pure-speech and non-speech or mixed-speech signals. The method accurately detects the pure-speech signals by computing a novel Valley Percentage feature from the audio signal and then classifying the audio signals into pure-speech and non-speech (or mixed-speech) classifications. The Valley Percentage is a measurement of the low energy parts of the audio signal (the valley) in comparison to the high energy parts of the audio signal (the mountain). To classify the audio signal, the method performs a threshold decision on the value of the Valley Percentage. Using a binary mask, a high Valley Percentage is classified as pure-speech and a low Valley Percentage is classified as non-speech (or mixed-speech). The method further employs morphological filters to improve the accuracy of human speech detection. Before detection, a morphological closing filter may be employed to eliminate unwanted noise from the audio signal. After detection, a combination of morphological closing and opening filters may be employed to remove aberrant pure-speech and non-speech classifications from the binary mask resulting from impulsive audio signals in order to more accurately detect the boundaries between the pure-speech and non-speech portions of the audio signal. A number of parameters may be employed by the method to further improve the accuracy of human speech detection. For implementation in supervised digital audio signal applications, these parameters may be optimized by training the application a priori. For implementation in an unsupervised environment, adaptive determination of these parameters is also possible.
37 Citations
35 Claims
-
1. A method for detecting pure speech signals in an audio signal having pure speech and non-speech or mixed-speech signals, the method comprising:
-
computing from the audio signal a valley percentage feature, the valley percentage feature representing for a point in the audio signal a proportion of plural surrounding points that are low energy surrounding points, wherein a low energy surrounding point has an energy level that falls below a threshold energy level for the plural surrounding points;
classifying the audio signal into either a pure-speech or non-speech classification according to the valley percentage feature; and
determining the boundaries between a portion of the audio signal classified as pure-speech and a portion of the audio signal classified as non-speech. - View Dependent Claims (2, 3, 4, 5)
converting the audio signal into an energy component having a plurality of energy levels, wherein each energy level corresponds to an audio sample of the audio signal; and
applying a morphological closing filter to each energy level of the energy component to produce a filtered energy component of the audio signal.
-
-
4. The method of claim 3 wherein the energy component of the audio signal is constructed by assigning to each energy level of the energy component, the absolute value of the corresponding audio sample of the audio signal.
-
5. A computer-readable medium having instructions for performing the steps of claim 1.
-
6. A method for detecting pure speech signals in an audio signal having pure speech and non-speech or mixed-speech signals, the method comprising:
-
filtering the audio signal to produce a clean audio signal, where the clean audio signal retains distinct boundaries between the pure-speech and non-speech portions, yet with less noise, wherein the filtering includes;
converting the audio signal into an energy component having a plurality of energy levels, wherein each energy level corresponds to an audio sample of the audio signal; and
applying a morphological closing filter to each energy level of the energy component to produce a filtered energy component of the audio signal by, positioning a first window over a plurality of energy levels such that a first energy level is positioned near a mid-point of the first window;
dilating the first energy level to a maximum energy level of the surrounding energy levels viewed through the first window;
repositioning the first window over a plurality of energy levels to a next consecutive energy level such that the next consecutive energy level is positioned near a mid-point of the first window;
repeatedly performing the dilating and repositioning until all of the energy levels of the energy component have been dilated;
repositioning the first window over the first energy level;
eroding the first energy level to a minimum energy level of the surrounding energy levels viewed through the first window;
repositioning the first window over a plurality of energy levels to the next consecutive energy level; and
repeatedly performing the eroding and repositioning until all of the energy levels of the energy component have been eroded, resulting in a plurality of filtered energy levels of the energy component;
computing from the audio signal a valley percentage feature;
classifying the audio signal into either a pure-speech or non-speech classification according to the valley percentage feature; and
determining the boundaries between a portion of the audio signal classified as pure-speech and a portion of the audio signal classified as non-speech. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15)
positioning a second window over a plurality of filtered energy levels such that a first filtered energy level is positioned near a mid-point of the second window;
assigning to the valley percentage feature the percentage of the number of filtered energy levels that fall below a threshold energy level of the surrounding filtered energy levels viewed through the second window, as compared to the total number of filtered energy levels viewed through the second window;
repositioning the second window over a plurality of filtered energy levels to a next consecutive filtered energy level such that the next consecutive filtered energy level is positioned near a mid-point of the second window; and
repeatedly performing the assigning and repositioning until all of the filtered energy levels of the energy component have been assigned, resulting in the valley percentage feature of the audio signal.
-
-
9. The method of claim 8 wherein the threshold energy level is selected by minimizing a difference between a known boundary of pure-speech and non-speech portions of a training audio signal and a test boundary determined across a parameter space.
-
10. The method of claim 8 wherein the second window is a duration of time selected by minimizing a difference between a known boundary of pure-speech and non-speech portions of a training audio signal and a test boundary determined across a parameter space.
-
11. The method of claim 8 wherein the pure speech versus non-speech classification is determined by assigning to a speech decision mask corresponding to each audio sample of the audio signal, a binary value of either:
-
zero to signify the presence of non-speech or mixed-speech, when the corresponding valley percentage feature is equal to or falls below a predetermined threshold valley percentage;
orone to signify the presence of pure-speech, when the corresponding valley percentage feature rises above the predetermined threshold valley percentage.
-
-
12. The method of claim 11 wherein a boundary between the pure-speech and non-speech classifications is determined by:
-
discarding the values of the speech decision mask that are isolated, wherein the isolated value'"'"'s neighboring values have an opposite value; and
marking the boundaries between the remaining values of the speech decision mask equal to a binary one and the remaining values of the speech decision mask equal to a binary zero.
-
-
13. The method of claim 11 wherein the boundary between the pure-speech and non-speech classifications is determined by applying a morphological opening filter and a morphological closing filter to a speech decision mask, and marking the boundaries between a portion of the filtered speech decision mask having consecutive binary values of one, and a portion of the filtered speech decision mask having consecutive binary values of zero.
-
14. The method of claim 13 wherein the application of the morphological opening filter includes:
-
positioning a third window over a consecutive stream of values in the speech decision mask such that a first value is positioned near a mid-point of the third window;
eroding the first value to a minimum binary value of the surrounding values viewed through the third window;
repositioning the third window over a consecutive stream of values in the speech decision mask to a next consecutive value such that the next consecutive value is positioned near a mid-point of the third window;
repeatedly performing the eroding and repositioning until all of the values of the speech decision mask corresponding to each audio sample of the audio signal have been eroded;
positioning the third window over a consecutive stream of eroded values such that a first eroded value is positioned near a mid-point of the third window;
dilating the first eroded value to a maximum binary value of the surrounding eroded values viewed through the third window;
repositioning the third window over a consecutive stream of eroded values in the speech decision mask to a next consecutive value such that the next consecutive value is positioned near a mid-point of the third window; and
repeatedly performing the dilating and repositioning until all of the values in the speech decision mask corresponding to each audio sample of the audio signal have been dilated, resulting in an opened speech decision mask corresponding to the audio signal.
-
-
15. The method of claim 14 wherein the application of the morphological closing filter includes:
-
positioning a fourth window over a consecutive stream of values in the opened speech decision mask such that a first opened value is positioned near a mid-point of the fourth window;
dilating the first opened value to a maximum binary value of the surrounding opened values viewed through the fourth window;
repositioning the fourth window over a consecutive stream of values in the opened speech decision mask to a next consecutive opened value such that the next consecutive opened value is positioned near a mid-point of the fourth window;
repeatedly performing the dilating and repositioning until all of the values in the opened speech decision mask corresponding to each audio sample of the audio signal have been dilated, resulting in a dilated opened speech decision mask corresponding to the audio signal;
positioning the fourth window over a consecutive stream of values in the dilated opened speech decision mask such that a first dilated opened value is positioned near a mid-point of the fourth window;
eroding the first dilated opened value to a minimum binary zero value of the surrounding dilated opened values viewed through the fourth window;
repositioning the fourth window over a consecutive stream of dilated opened values such that the next consecutive dilated opened value is positioned near a mid-point of the fourth window; and
repeatedly performing the eroding and repositioning until all of the values in the dilated opened speech decision mask corresponding to each audio sample of the audio signal have been eroded, resulting in a closed speech decision mask corresponding to the audio signal.
-
-
16. A computer-readable medium on which is stored software for performing speech detection on an audio signal, the software, when executed by a computer, comprising instructions for:
-
storing a plurality of predetermined parameters for detecting pure-speech signals in an audio signal having pure-speech and non-speech or mixed-speech signals;
cleaning the audio signal to remove noise, wherein the audio signal comprises a plurality of audio samples in a first window, the first window having a size equal to one of the predetermined parameters;
computing from the clean audio signal a valley percentage, wherein the valley percentage is computed from a plurality of audio samples in a second window, the second window having a size equal to another one of the predetermined parameters, and wherein the valley percentage represents for an audio sample the number of audio samples in the second window having an energy level falling below a threshold energy level as compared to the total number of audio samples in the second window;
classifying the value of the valley percentage into either the pure-speech or non-speech classifications according to another one of the predetermined parameters; and
determining the boundaries between a plurality of pure-speech and non-speech classifications by eliminating isolated pure-speech and non-speech classifications in a respective third and fourth windows, the third and fourth windows having sizes equal to another two of the predetermined parameters. - View Dependent Claims (17, 18)
converting each audio sample in the first window into a corresponding energy level, the energy levels comprising an energy component; and
applying a closing filter to the energy component resulting in a corresponding clean audio signal, where the clean audio signal retains distinct boundaries between pure-speech and non-speech portions, yet with less noise.
-
-
18. The computer-readable medium of claim 16 wherein the size of the first window is selected by minimizing a difference between a known boundary of pure-speech and non-speech portions of a training audio signal and a test boundary determined across a parameter space.
-
19. A computer-readable medium on which is stored software for performing speech detection on an audio signal, the software, when executed by a computer, comprising instructions for:
-
storing a plurality of predetermined parameters for detecting pure-speech signals in an audio signal having pure-speech and non-speech or mixed-speech signals;
cleaning the audio signal to remove noise, wherein the audio signal comprises a plurality of audio samples in a first window, the first window having a size equal to one of the predetermined parameters, the cleaning comprising;
converting each audio sample in the first window into a corresponding energy level, the energy levels comprising an energy component;
applying a closing filter to the energy component resulting in a corresponding clean audio signal, where the audio signal retains distinct boundaries between pure-speech and non-speech portions, yet with less noise;
computing from the clean audio signal a valley percentage, wherein the valley percentage is computed from a plurality of audio samples in a second window, the second window having a size equal to another one of the predetermined parameters, wherein the computing comprises;
determining a number of audio samples in the second window having an energy level falling below a threshold energy level, according to another one of the predetermined parameters; and
setting the valley percentage equal to a percentage of the number of audio samples in the second window having an energy level falling below the threshold energy level, as compared to the total number of audio samples in the second window;
classifying the value of the valley percentage into either the pure-speech or non-speech classifications according to another one of the predetermined parameters; and
determining the boundaries between a plurality of pure-speech and non-speech classifications by eliminating isolated pure-speech and non-speech classifications in a respective third and fourth windows, the third and fourth windows having sizes equal to another two of the predetermined parameters. - View Dependent Claims (20, 21, 22)
determining a maximum energy level in the second window; and
multiplying the maximum energy level by a fraction, the fraction having a value equal to another one of the predetermined parameters.
-
-
22. The computer-readable medium of claim 21 wherein the fraction is selected by minimizing a difference between a known boundary of pure-speech and non-speech portions of a training audio signal and a test boundary determined across a parameter space.
-
23. A computer-readable medium on which is stored software for performing speech detection on an audio signal, the software, when executed by a computer, comprising instructions for:
-
storing a plurality of predetermined parameters for detecting pure-speech signals in an audio signal having pure-speech and non-speech or mixed-speech signals;
cleaning the audio signal to remove noises wherein the audio signal comprises a plurality of audio samples in a first window, the first window having a size equal to one of the predetermined parameters;
computing from the clean audio signal a valley percentage, wherein the valley percentage is computed from a plurality of audio samples in a second window, the second window having a size equal to another one of the predetermined parameters;
classifying the value of the valley percentage into either the pure-speech or non-speech classifications according to another one of the predetermined parameters, wherein the classifying comprises;
comparing the value of the valley percentage to a threshold valley percentage, the threshold valley percentage having a value equal to another one of the predetermined parameters; and
setting a value in a binary decision mask corresponding to the value of the valley percentage to a value of zero where the valley percentage is equal to or less than the threshold valley percentage, or to a value of one where the valley percentage is greater than the threshold valley percentage; and
determining the boundaries between a plurality of pure-speech and non-speech classifications by eliminating isolated pure-speech and non-speech classifications in a respective third and fourth windows, the third and fourth windows having sizes equal to another two of the predetermined parameters. - View Dependent Claims (24, 25, 26)
-
-
27. A computer-readable medium on which is stored software for performing speech detection on an audio signal, the software, when executed by a computer, comprising instructions for:
-
storing a plurality of predetermined parameters for detecting pure-speech signals in an audio signal having pure-speech and non-speech or mixed-speech signals;
cleaning the audio signal to remove noise, wherein the audio signal comprises a plurality of audio samples in a first window, the first window having a size equal to one of the predetermined parameters, the cleaning comprising;
converting each audio sample in the first window into a corresponding energy level, the energy levels comprising an energy component;
applying a closing filter to the energy component resulting in a corresponding clean audio signal, where the audio signal retains distinct boundaries between pure-speech and non-speech portions, yet with less noise, wherein the applying includes;
dilating the energy levels of the energy component in the first window; and
eroding the dilated energy levels of the energy component in the first window;
computing from the clean audio signal a valley percentage, wherein the valley percentage is computed from a plurality of audio samples in a second window, the second window having a size equal to another one of the predetermined parameters;
classifying the value of the valley percentage into either the pure-speech or non-speech classifications according to another one of the predetermined parameters; and
determining the boundaries between a plurality of pure-speech and non-speech classifications by eliminating isolated pure-speech and non-speech classifications in a respective third and fourth windows, the third and fourth windows having sizes equal to another two of the predetermined parameters.
-
-
28. A computer-readable medium on which is stored software for performing speech detection on an audio signal, the software, when executed by a computer, comprising instructions for:
-
storing a plurality of predetermined parameters for detecting pure-speech signals in an audio signal having pure-speech and non-speech or mixed-speech signals;
cleaning the audio signal to remove noise, wherein the audio signal comprises a plurality of audio samples in a first window, the first window having a size equal to one of the predetermined parameters;
computing from the clean audio signal a valley percentage, wherein the valley percentage is computed from a plurality of audio samples in a second window, the second window having a size equal to another one of the predetermined parameters;
classifying the value of the valley percentage into either the pure-speech or non-speech classifications according to another one of the predetermined parameters; and
determining the boundaries between a plurality of pure-speech and non-speech classifications by eliminating isolated pure-speech and non-speech classifications in a respective third and fourth windows, the third and fourth windows having sizes equal to another two of the predetermined parameters, wherein the determining comprises;
applying a morphological opening filter to the plurality of pure-speech and non-speech classifications in the third window; and
applying a morphological closing filter to the plurality of pure-speech and non-speech classifications in the fourth window.
-
-
29. A method for extracting speech detection features in an audio signal having a mixture of speech and non-speech audio samples, the method comprising:
-
determining an energy level for each of plural audio samples in an audio signal;
extracting a speech detection feature for each of the plural audio samples by, determining a maximum energy level in a range of plural surrounding audio samples;
calculating a threshold energy level as a fraction of the maximum energy level; and
setting the speech detection feature based upon a percentage of the plural surrounding audio samples that have an energy level falling below the threshold energy level. - View Dependent Claims (30, 31)
before extracting, filtering the audio signal to clean the audio signal while preserving boundary distinctions in the audio signal.
-
-
31. The method of claim 29 further comprising:
after extracting, classifying the plural audio samples of the audio signal as speech or non-speech based upon comparison of the extracted speech detection features to a speech detection feature threshold.
-
32. A computer readable medium on which is stored software for extracting speech detection features for an audio signal having a mixture of speech and non-speech audio portions, the software comprising instructions for:
-
determining an energy level for each of plural audio samples in an audio signal;
filtering the audio signal to clean the audio signal while preserving boundary distinctions in the audio signal; and
extracting a speech detection feature for each of plural portions of the filtered audio signal, each portion including one or more audio samples, each speech detection feature based upon a percentage of surrounding portions of the filtered audio signal that have an energy level falling below a threshold energy level for the surrounding portions. - View Dependent Claims (33)
-
-
34. A method for extracting speech detection features for an audio signal having a mixture of speech and non-speech audio portions, the method comprising:
-
determining an energy level for each of plural audio samples in an audio signal;
extracting a speech detection feature for each of plural portions of the audio signal, each portion including one or more audio samples, each speech detection feature based upon a percentage of surrounding portions of the audio signal that have an energy level falling below a threshold energy level for the surrounding portions;
setting a classification for each of the plural portions as speech or non-speech based upon a comparison of the extracted speech detection feature for the portion to a speech detection feature threshold; and
filtering the classifications to remove isolated classifications, wherein an isolated classification has a value differing from a predominant value for surrounding classifications, and wherein the filtering uses one or more morphological filters. - View Dependent Claims (35)
-
Specification