Wavelet-based energy binning cepstal features for automatic speech recognition

US 6,253,175 B1
Filed: 11/30/1998
Issued: 06/26/2001
Est. Priority Date: 11/30/1998
Status: Expired due to Term

First Claim

Patent Images

1. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting spectral features from acoustic speech signals for use in automatic speech recognition, said method steps comprising:

digitizing acoustic speech signals for at least one of a plurality of frames of speech;

performing a first transform on each of said frames of digitized acoustic speech signals to extract spectral parameters for each frame;

performing a squeezing transform on said spectral parameters of each frame by grouping spectral components having similar instantaneous frequencies such that acoustic energy is concentrated at the instantaneous frequency values;

clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;

mapping frequency, bandwidth and weight values to each element for each frame of speech;

mapping each element with its corresponding frame; and

generating spectral features from said element for each frame.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for processing acoustic speech signals which utilize the wavelet transform (and alternatively, the Fourier transform) as a fundamental tool. The method essentially involves “synchrosqueezing” spectral component data obtained by performing a wavelet transform (or Fourier transform) on digitized speech signals. In one aspect, spectral components of the synchrosqueezed plane are dynamically tracked via a K-means clustering algorithm. The amplitude, frequency and bandwidth of each of the components are, thus, extracted. The cepstrum generated from this information is referred to as “K-mean Wastrum.” In another aspect, the result of the K-mean clustering process is further processed to limit the set of primary components to formants. The resulting features are referred to as “formant-based wastrum.” Formants are interpolated in unvoiced regions and the contribution of unvoiced turbulent part of the spectrum are added. This method requires adequate formant tracking. The resulting robust formant extraction has a number of applications in speech processing and analysis including vocal tract normalization.

Citations

20 Claims

1. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting spectral features from acoustic speech signals for use in automatic speech recognition, said method steps comprising:
- digitizing acoustic speech signals for at least one of a plurality of frames of speech;
  
  performing a first transform on each of said frames of digitized acoustic speech signals to extract spectral parameters for each frame;
  
  performing a squeezing transform on said spectral parameters of each frame by grouping spectral components having similar instantaneous frequencies such that acoustic energy is concentrated at the instantaneous frequency values;
  
  clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;
  
  mapping frequency, bandwidth and weight values to each element for each frame of speech;
  
  mapping each element with its corresponding frame; and
  
  generating spectral features from said element for each frame.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further including the step of applying constraints to filter said elements having values that fall below a determined threshold.
  - 3. The method of claim 1 further comprising the step of iteratively performing said clustering step in order to obtain convergence of said cluster centers for each frame.
  - 4. The method of claim 1, wherein said first transform step is performed using a windowed Fourier transform.
  - 5. The method of claim 1, wherein said first transform step is performed using a wavelet transform.
  - 6. The method of claim 5, wherein said wavelet transform is implements as a quasi-continuous wavelet transform.
  - 7. The method of claim 1, wherein said clustering step is performed using K-means clustering.
  - 8. The method of claim 1, wherein said spectral features are generated by processing said element data with Schroeder'"'"'s formula.

9. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for extracting spectral features from acoustic speech signals for use in automatic speech recognition, said method steps comprising:
- digitizing acoustic speech signals for at least one of a plurality of frames of speech;
  
  performing a first transform on each of said frames of digitized acoustic speech signals to extract spectral parameters for each frame;
  
  performing a squeezing transform on said spectral parameters of each frame by grouping spectral components having similar instantaneous frequencies such that acoustic energy is concentrated at the instantaneous frequency values;
  
  clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;
  
  mapping frequency, bandwidth and weight values to each element for each frame of speech;
  
  mapping each element with its corresponding frame;
  
  partitioning the elements of each frame to determine at least one centroid;
  
  designating said determined centroids as formants;
  
  generating spectral features for each frame of speech from said formants.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The method of claim 9, further comprising the step of iteratively performing said partitioning step in order to obtain convergence of said centroids for each frame.
  - 11. The method of claim 9, wherein said first transform step is performed using a windowed Fourier transform.
  - 12. The method of claim 9, wherein said first transform step is performed using a wavelet transform.
  - 13. The method of claim 9, further comprising the step of selecting centroids for a previous frame as seeds for partitioning a successive frame.

14. A system for processing acoustic speech signals, comprising:
- means for digitizing input acoustic speech signals, said input acoustic speech signal being divided into a plurality of successive frames;
  
  first transform means for transforming said digitized speech signal of each frame into a plurality of spectral components;
  
  synchrosqueezing transform means for assigning each of said spectral components for each frame into a corresponding one a plurality of pseudo-frequency groups, said pseudo-frequency groups being representative of primary spectral components for each frame of speech; and
  
  mel binning means for clustering said synchrosqueezed data in each frame to produce a feature vector having n-parameters for each frame.
- View Dependent Claims (15, 16, 17)
- - 15. The system of claim 14, further comprising decorrelation means for processing the feature vector for each frame by decorrelating the n parameters comprising the feature vector.
  - 16. The method of claim 15, wherein said decorrelation means is implemented as one of a linear discriminant analysis algorithm and a discrete cosine transform algorithm.
  - 17. The method of claim 14, wherein said first transform means is a wavelet transform.

18. A system for processing acoustic speech signals, comprising:
- means for digitizing input acoustic speech signals, said input acoustic speech signal being divided into a plurality of successive frames;
  
  first transform means for transforming said digitized speech signal of each frame into a plurality of spectral components;
  
  synchrosqueezing transform means for assigning each of said spectral components for each frame into a corresponding one a plurality of pseudo-frequency groups, said pseudo-frequency groups being representative of primary spectral components for each frame of speech;
  
  means for clustering said squeezed spectral parameters to determine elements corresponding to each frame, the location of the elements being determined by cluster centers resulting from said clustering;
  
  and cepstra generating means for generating feature vectors from said elements.
- View Dependent Claims (19, 20)
- - 19. The system of claim 18, further comprising means for partitioning the elements of each frame to determine formants for each frame, the formants being equal to centroids computed by said partitioning means, whereby said formants are used by said cepstra generating means to produce feature vectors.
  - 20. The system of claim 19 wherein said cepstra is generated using Schroeder'"'"'s formula.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Basu, Sankar, Maes, Stephane H.
Primary Examiner(s)
Tsang, Fan
Assistant Examiner(s)
Opsasnick, Michael N.

Application Number

US09/201,055
Time in Patent Office

939 Days
Field of Search

704/231, 704/236, 704/237, 704/245, 704/250
US Class Current

704/231
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 2015/0631   Creating reference template...

G10L 25/27   characterised by the analys...

Wavelet-based energy binning cepstal features for automatic speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Wavelet-based energy binning cepstal features for automatic speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links