Wavelet-based energy binning cepstal features for automatic speech recognition
First Claim
1. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting spectral features from acoustic speech signals for use in automatic speech recognition, said method steps comprising:
- digitizing acoustic speech signals for at least one of a plurality of frames of speech;
performing a first transform on each of said frames of digitized acoustic speech signals to extract spectral parameters for each frame;
performing a squeezing transform on said spectral parameters of each frame by grouping spectral components having similar instantaneous frequencies such that acoustic energy is concentrated at the instantaneous frequency values;
clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;
mapping frequency, bandwidth and weight values to each element for each frame of speech;
mapping each element with its corresponding frame; and
generating spectral features from said element for each frame.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for processing acoustic speech signals which utilize the wavelet transform (and alternatively, the Fourier transform) as a fundamental tool. The method essentially involves “synchrosqueezing” spectral component data obtained by performing a wavelet transform (or Fourier transform) on digitized speech signals. In one aspect, spectral components of the synchrosqueezed plane are dynamically tracked via a K-means clustering algorithm. The amplitude, frequency and bandwidth of each of the components are, thus, extracted. The cepstrum generated from this information is referred to as “K-mean Wastrum.” In another aspect, the result of the K-mean clustering process is further processed to limit the set of primary components to formants. The resulting features are referred to as “formant-based wastrum.” Formants are interpolated in unvoiced regions and the contribution of unvoiced turbulent part of the spectrum are added. This method requires adequate formant tracking. The resulting robust formant extraction has a number of applications in speech processing and analysis including vocal tract normalization.
-
Citations
20 Claims
-
1. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting spectral features from acoustic speech signals for use in automatic speech recognition, said method steps comprising:
-
digitizing acoustic speech signals for at least one of a plurality of frames of speech;
performing a first transform on each of said frames of digitized acoustic speech signals to extract spectral parameters for each frame;
performing a squeezing transform on said spectral parameters of each frame by grouping spectral components having similar instantaneous frequencies such that acoustic energy is concentrated at the instantaneous frequency values;
clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;
mapping frequency, bandwidth and weight values to each element for each frame of speech;
mapping each element with its corresponding frame; and
generating spectral features from said element for each frame. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting spectral features from acoustic speech signals for use in automatic speech recognition, said method steps comprising:
-
digitizing acoustic speech signals for at least one of a plurality of frames of speech;
performing a first transform on each of said frames of digitized acoustic speech signals to extract spectral parameters for each frame;
performing a squeezing transform on said spectral parameters of each frame by grouping spectral components having similar instantaneous frequencies such that acoustic energy is concentrated at the instantaneous frequency values;
clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;
mapping frequency, bandwidth and weight values to each element for each frame of speech;
mapping each element with its corresponding frame;
partitioning the elements of each frame to determine at least one centroid;
designating said determined centroids as formants;
generating spectral features for each frame of speech from said formants. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A system for processing acoustic speech signals, comprising:
-
means for digitizing input acoustic speech signals, said input acoustic speech signal being divided into a plurality of successive frames;
first transform means for transforming said digitized speech signal of each frame into a plurality of spectral components;
synchrosqueezing transform means for assigning each of said spectral components for each frame into a corresponding one a plurality of pseudo-frequency groups, said pseudo-frequency groups being representative of primary spectral components for each frame of speech; and
mel binning means for clustering said synchrosqueezed data in each frame to produce a feature vector having n-parameters for each frame. - View Dependent Claims (15, 16, 17)
-
-
18. A system for processing acoustic speech signals, comprising:
-
means for digitizing input acoustic speech signals, said input acoustic speech signal being divided into a plurality of successive frames;
first transform means for transforming said digitized speech signal of each frame into a plurality of spectral components;
synchrosqueezing transform means for assigning each of said spectral components for each frame into a corresponding one a plurality of pseudo-frequency groups, said pseudo-frequency groups being representative of primary spectral components for each frame of speech;
means for clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;
and cepstra generating means for generating feature vectors from said elements. - View Dependent Claims (19, 20)
-
Specification