Line spectral frequencies and energy features in a robust signal recognition system

US 6,009,391 A
Filed: 08/06/1997
Issued: 12/28/1999
Est. Priority Date: 06/27/1997
Status: Expired due to Term

First Claim

Patent Images

1. A speech recognition system comprising:

a line spectral pair frequency coefficient generator;

an energy coefficients generator; and

a first speech classifier capable of using Nth order vectors to generate first speech classification output data for classifying a speech input signal as recognized speech, wherein the speech input signal is represented by a number of frames with each frame represented by one of the Nth order vectors, wherein components of each Nth order vector include respective line spectral pair frequency coefficients for P orders generated by the line spectral pair frequency coefficient generator, a first energy coefficient generated by the energy coefficients generator and representing original energy of the speech input signal for the respective frame, and a second energy coefficient generated by the energy coefficients generator and representing a first derivative of the original energy of the speech input signal for the respective frame, wherein N and P are integers.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One embodiment of a speech recognition system is organized with speech input signal preprocessing and feature extraction followed by a fuzzy matrix quantizer (FMQ). Frames of the speech input signal are represented in a matrix by a vectorf of line spectral pair frequencies and energy coefficients and are fuzzy matrix quantized to respective vector f entries of a matrix codeword in a codebook of the FMQ. The energy coefficients include the original energy and the first and second derivatives of the original energy which increase recognition accuracy by, for example, being generally distinctive speech input signal parameters and providing noise signal suppression especially when the noise signal has a relatively constant energy over at least two time frame intervals. To reduce data while maintaining sufficient resolution, the energy coefficients may be normalized and logarithmically represented. A distance measure between f and f, d(f, f), is defined as ##EQU1## where the constants α₁, α₂, β₁ and β₂ are set to substantially minimize quantization error, e_i is the error power spectrum of the speech input signal and a predicted speech input signal at the ith line spectral pair frequency of the speech input signal, the first G LSP frequencies are most likely to be frequency shifted by noise, and the last P+3 coefficients represent the three energy coefficients. This robust distance measure can be used to enhance speech recognition performance in generally any speech recognition system using line spectral pair based distance measures.

59 Citations

View as Search Results

35 Claims

1. A speech recognition system comprising:
- a line spectral pair frequency coefficient generator;
  
  an energy coefficients generator; and
  
  a first speech classifier capable of using Nth order vectors to generate first speech classification output data for classifying a speech input signal as recognized speech, wherein the speech input signal is represented by a number of frames with each frame represented by one of the Nth order vectors, wherein components of each Nth order vector include respective line spectral pair frequency coefficients for P orders generated by the line spectral pair frequency coefficient generator, a first energy coefficient generated by the energy coefficients generator and representing original energy of the speech input signal for the respective frame, and a second energy coefficient generated by the energy coefficients generator and representing a first derivative of the original energy of the speech input signal for the respective frame, wherein N and P are integers.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The speech recognition system of claim 1 wherein each frame is further represented by a third energy coefficient representing a second derivative of the original energy of the speech input signal for the respective frame.
  - 3. The speech recognition system of claim 1 wherein the first energy coefficient representing the original energy, E_y, of the speech input signal for the respective yth frame is defined as:
    - ##EQU30## wherein s(n)_y is a discrete time representation of the speech input signal in the yth frame, N represents the number of samples of the speech input signal in the yth frame, and TO is an integer representing the total number of frames representing the speech input signal; and
      wherein the second energy coefficient representing the energy, E'"'"'_y, in the first derivative of E_y for the respective yth frame is defined as;
      
      space="preserve" listing-type="equation">E'"'"'.sub.y =(E.sub.y)'"'"'=E.sub.y -E.sub.y-1 ;
      wherein E_y-1 is the original energy in the frame of the speech input signal preceding the yth frame.
  - 4. The speech recognition system of claim 1 wherein the first and second energy coefficients are normalized with respect to energy in a frame of the speech input signal having a maximum energy with respect to the remaining frames.
  - 5. The speech recognition system of claim 4 wherein the normalized first and second energy coefficients are further represented logarithmically.
  - 6. The speech recognition system of claim 1 wherein each frame is further represented by a third energy coefficient representing a second derivative, E"_y, of the speech input signal original energy for the respective frame and is defined as:
    - space="preserve" listing-type="equation">E".sub.y =(E'"'"'.sub.y)'"'"'=E'"'"'.sub.y-1 ;
      wherein E'"'"'_y-1 represents the first derivative of the energy in the frame preceding the yth frame.
  - 7. The speech recognition system of claim 1 further comprising:
    - a quantizer for determining respective distance measures for each respective frame of the speech input signal between the first G line spectral pair frequencies of the speech input signal and G corresponding order line spectral pair frequencies of a plurality of respective reference vectors, wherein the distance measure for an ith line spectral pair frequency and an ith reference speech signal line spectral pair frequency, for each of i=1 to G line spectral pair frequencies, is proportional to (i) a difference between the ith line spectral pair frequencies and the ith reference speech signal line spectral pair frequencies and (ii) a shift of the difference by an ith frequency shifting factor to at least partially compensate for frequency shifting of the ith speech input signal line spectral pair frequency by speech noise, wherein G is greater than or equal to one and less than or equal to P.
  - 8. The speech recognition system of claim 7 wherein the quantizer is for further determining respective distance measures between ith speech input signal line spectral pair frequencies and the ith reference speech signal line spectral pair frequencies of the reference vectors, wherein the respective distance measures, for i=G+1 to P, are derived from (i) a difference between the ith speech input signal line spectral pair frequencies of each reference vector and the ith reference speech signal line spectral pair frequency and (ii) a weighting of the respective differences by an ith frequency weighting factor.
  - 9. The speech recognition system of claim 8 wherein the quantizer is for further determining a distance measure, d(f, f), between the speech input signal, f, and each of the reference speech signals, f, d(f, f) is defined by:
    - ##EQU31## wherein f_i and f_i are the ith line spectral pair frequencies in the speech input signal and the reference speech signal, respectively, E_i and E_i are the ith energy coefficients, the constants ≢
      
      ₁, α
      
      ₂, α
      
      ₃, β
      
      ₁, and β
      
      ₂ are set to substantially minimum quantization error, and e_i is the error power spectrum of the speech input signal and a predicted speech input signal at the ith line spectral pair frequency of the speech input signal.
  - 10. The speech recognition system of claim 9 therein the i=1 to G line spectral pair frequencies are in the 0 to 400 Hz range.
  - 11. The speech recognition system of claim 9 wherein α
    - ₁ is set to 1.6, α
      
      ₂ is set to 0.68, β
      
      ₁ is set to 0.5, and β
      
      ₂ is set to 0.25.
  - 12. The speech recognition system of claim 7 wherein the ith frequency shifting factor is proportional to a power spectrum of a linear prediction error at the ith line spectral pair frequency.
  - 13. The speech recognition system of claim 7 wherein the quantizer includes a codebook having C codewords, wherein C is an integer and each codeword is comprised of a set of reference vectors wherein the first speech classification output data is based on the distance measures, the speech recognition system further comprising:
    - a second speech classifier to receive the first speech classification output data based on the distance measures and to generate second speech classification output data to classify the speech input signal as one of u vocabulary words, wherein u is an integer.
  - 14. The speech recognition system of claim 13 wherein the quantizer is a single codebook quantizer having C times u codewords representing a vocabulary of u words.
  - 15. The speech recognition system of claim 13 further comprising:
    - a third speech classifier to receive the second speech classification output data from the second speech classifier and classify the speech input signal as one of the u vocabulary words.
  - 16. The speech recognition system of claim 13 wherein the second speech classifier is a neural network.
  - 17. The speech recognition system of claim 13 wherein the quantizer is a fuzzy matrix quantizer further for generating respective fuzzy distance measures between the respective speech input signal and reference speech signal P line spectral pair frequencies and corresponding energy coefficients using the corresponding generated distance measures;
    - andwherein the second speech classifier includes a neural network and the output data is a fuzzy distance measure proportional to a combination of the generated fuzzy distance measures.
  - 18. The speech recognition system of claim 17 wherein the quantizer is a fuzzy matrix quantizer further for generating an observation sequence of indices indicating the relative closeness between the respective speech input signal and reference speech signal P line spectral pair frequencies and corresponding energy coefficients;
    - andwherein the second speech classifier includes u hidden Markov models and a fuzzy Viterbi algorithm module for determining a respective probability for each of the u hidden Markov models that the respective hidden Markov model produced the observation sequence.
  - 19. The speech recognition system of claim 1 further comprising a computer system having a memory to store the speech processing module and a processor coupled to the memory for executing the speech processing module.

20. An apparatus comprising:
- means for generating P order line spectral pair frequencies for an acoustic input signal;
  
  means for determining a difference, for i=1 to G, between the ith line spectral pair frequency and an ith line spectral frequency of a reference acoustic signal;
  
  means for shifting the difference by an ith frequency shifting factor, for i=1 to G, to at least partially compensate for frequency shifting of the ith acoustic input signal line spectral pair frequency by acoustic noise;
  
  means for determining a difference, for i=G +1 to P, between ith acoustic input signal line spectral pair frequency and the ith reference acoustic signal line spectral pair frequency;
  
  means for weighting of the difference by an ith frequency weighting factor, for i=G+1 to P, wherein ith frequency shifting and weighting factor is the error power spectrum of the acoustic input signal and a predicted acoustic input signal at the ith line spectral pair frequency of the acoustic input signal;
  
  means for determining an energy of the acoustic input signal;
  
  means for determining a first derivative of the acoustic input signal energy; and
  
  means for utilizing the shifted and weighted differences for each of the P line spectral pair frequencies, the energy of the acoustic input signal, and the first derivative of the acoustic input signal energy to classify the acoustic input signal.

21. A method of generating a robust distance measure in a speech recognition system comprising the steps of:
- determining energy coefficients of each of X frames of a speech input signal, wherein the step of determining energy coefficients comprises the steps of;
  
  determining a first energy coefficient for each of the X frames, wherein the first energy coefficient represents original energy of the speech input signal for a respective one of the X frames; and
  
  determining a second energy coefficient for each of the X frames, wherein the second energy coefficient represents a first derivative of the original energy of the respective one of the X frames;
  
  determining P order line spectral pair frequencies for the speech input signal;
  
  representing the energy coefficients and line spectral pair frequencies as components of a vector;
  
  determining respective differences between the energy coefficients of the speech input signal and corresponding energy coefficients of a plurality of reference codewords;
  
  determining respective differences between the respective P line spectral frequencies of the speech input signal and corresponding P line spectral frequencies of the reference codewords; and
  
  utilizing the energy coefficients and line spectral pair frequencies respective differences to classify the speech input signal as one of the reference codewords.
- View Dependent Claims (22, 23, 24, 25)
- - 22. The method of claim 21 wherein the step of determining the energy coefficients comprises the steps of:
    - for each of the X frames of the speech input signal, sampling the frame at a rate of n samples per second to represent the speech input signal as s(n)_y, y=1, 2, . . . , X;
      wherein determining a first energy coefficient for each of the X frames comprises generating an original energy coefficient, E_y, for each frame of speech input signal, wherein E_y is defined as ##EQU32## wherein N represents a number of the samples in the yth frame;
      
      wherein determining a second energy coefficient for each of the X frames comprises generating a first derivative of the original energy coefficient, E'"'"'_y, for each frame of speech input signal, wherein E'"'"'_y is defined as
      
      space="preserve" listing-type="equation">E'"'"'.sub.y =(E.sub.y)'"'"'=E.sub.y -E.sub.y-1 ; and
      generating a second derivative of the original energy coefficient, E"_y, for each frame of speech input signal, wherein E"_y is defined as
      
      space="preserve" listing-type="equation">E".sub.y =(E'"'"'.sub.y)'"'"'=E'"'"'.sub.y -E'"'"'.sub.y-1.23.
  - 23. The method of claim 22 further comprising the steps of:
    - normalizing the original energy coefficient, E_y ;
      
      normalizing the first derivative of the original energy coefficient, E'"'"'_y ; and
      
      normalizing the second derivative of the original energy coefficient, E"_y.
  - 24. The method of claim 21 further comprising the steps of:
    - shifting the respective differences of the first G line spectral pair frequencies by respective frequency shifting factors to at least partially compensate for frequency shifting of the respective speech input signal line spectral pair frequencies by acoustic noise; and
      
      weighting the respective differences for the remaining G+1 to P line spectral pair frequencies with respective frequency weighting factors.
  - 25. The method of claim 24 further comprising the steps of:
    - weighting the respective differences of the first G line spectral pair frequencies by a first weighting constant, α
      
      ₁ ;
      
      weighting the respective differences of the remaining G+1 to P line spectral pair frequencies by a second weighting constant, α
      
      ₂ ;
      
      adding the respective differences together to generate a distance measure between the speech input signal and the reference speech signal; and
      
      utilizing the P line spectral pair frequency differences and energy coefficient differences to classify the speech input signal.

26. A method of robust speech recognition in an automotive environment comprising the steps of:
- receiving a speech input signal corrupted by automotive environment noise;
  
  representing each frame of the speech input signal with a vector f of P line spectral pair frequencies and X energy coefficients;
  
  representing each of n codewords in a quantizer codebook as a respective vector f of P line spectral pair frequencies and X energy coefficients, wherein n is a nonnegative integer; and
  
  determining a distance measure between the vector f and each respective vector f, wherein the distance measure, d(f,f), is defined by;
  
  ##EQU33## using the distance measure to classify the speech input signal as recognized speech;
  
  wherein the constants α
  
  ₁, α
  
  ₂, α
  
  ₃, β
  
  ₁ and β
  
  ₂ are set to substantially minimize quantization error, and e_i is the error power spectrum of the input signal and a predicted input signal at the ith line spectral pair frequency of the input signal.
- View Dependent Claims (27, 28, 29, 30)
- - 27. The method as in claim 26 further comprising the steps of:
    - using the distance measure, d(f, f), to generate fuzzy distance measures in an FMQ/HMM speech recognition system.
  - 28. The method as in claim 26 comprising the steps of:
    - using the distance measure, d(f,f), to generate fuzzy distance measures in an FMQ/HMM/MLP speech recognition system.
  - 29. The method as in claim 26 wherein the FMQ includes codebooks for each of u speech recognition system vocabulary words.
  - 30. The method as in claim 26 wherein X is three, and the three energy coefficients are the original energy of a respective frame of the speech input signal, a first derivative of the original energy, and a second derivative of the original energy.

31. An apparatus comprising:
- a first classifier capable of using Nth order vectors to generate first speech classification output data for classifying the input signal, wherein the input signal is represented by a number of frames with each frame represented by an Nth order vector, wherein components of each Nth order vector include respective line spectral pair frequency coefficients for P orders, a first energy coefficient representing original energy of the input signal for the respective frame, and a second energy coefficient representing a first derivative of the original energy of the speech input signal for the respective frame, wherein N and P are integers.
- View Dependent Claims (32, 33, 34)
- - 32. The apparatus of claim 31 further comprising:
    - a quantizer for determining respective distance measures for each respective frame of the input signal between the first G line spectral pair frequencies of the input signal and G corresponding order line spectral pair frequencies of a plurality of respective reference vectors, wherein the distance measure for an ith line spectral pair frequency and an ith reference signal line spectral pair frequency, for each of i=1 to G line spectral pair frequencies, is proportional to (i) a difference between the ith line spectral pair frequencies and the ith reference signal line spectral pair frequencies and (ii) a shift of the difference by an ith frequency shifting factor to at least partially compensate for frequency shifting of the ith input signal line spectral pair frequency by noise, wherein G is greater than or equal to one and less than or equal to P.
  - 33. The apparatus of claim 32 wherein the quantizer is for further determining respective distance measures between ith input signal line spectral pair frequencies and the ith reference signal line spectral pair frequencies of the reference vectors, wherein the respective distance measures, for i=G+1 to P, are derived from (i) a difference between the ith input signal line spectral pair frequencies of each reference vector and the ith reference signal line spectral pair frequency and (ii) a weighting of the respective differences by an ith frequency weighting factor.
  - 34. The apparatus of claim 33 wherein the quantizer is for further determining a distance measure, d(f, f) between the input signal, f, and each of the reference speech signals, f, d(f, f) is defined by:
    - ##EQU34## wherein f_i and f_i are the ith line spectral pair frequencies in the input sign and the reference signal, respectively, E_i and E_i are the ith energy coefficients, the constants α
      
      ₁, α
      
      ₂, α
      
      ₃, β
      
      ₁, and β
      
      ₂ are set to substantially quantization error, and e_i is the error power spectrum of the input signal and a predicted input signal at the ith line spectral pair frequency of the input signal.

35. A method comprising the steps of:
- determining energy coefficients of each of X frames of an input signal, wherein the step of determining energy coefficients comprises the steps of;
  
  determining a first energy coefficient for each of the X frames, wherein the first energy coefficient represents original energy of the input signal for a respective one of the X frames; and
  
  determining a second energy coefficient for each of the X frames, wherein the second energy coefficient represents a first derivative of the original energy of the respective one of the X frames;
  
  determining P order line spectral pair frequencies for the input signal;
  
  representing the energy coefficients and line spectral pair frequencies as components of a vector;
  
  determining respective differences between the energy coefficients of the input signal and corresponding energy coefficients of a plurality of reference codewords;
  
  determining respective differences between the respective P line spectral frequencies of the input signal and corresponding P line spectral frequencies of the reference codewords; and
  
  utilizing the energy coefficients and line spectral pair frequencies respective differences to classify the input signal as one of the reference codewords.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
RPX Corporation
Original Assignee
Advanced Micro Devices, Inc.
Inventors
Cong, Lin, Asghar, Safdar M.
Primary Examiner(s)
Hudspeth, David R.
Assistant Examiner(s)
Storm, Donald L.

Application Number

US08/907,145
Time in Patent Office

874 Days
Field of Search

704/243, 704/238, 704/236, 704/222
US Class Current

704/243
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/10   using distance or distortio...

G10L 15/20   Speech recognition techniqu...

Line spectral frequencies and energy features in a robust signal recognition system

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

59 Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Line spectral frequencies and energy features in a robust signal recognition system

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links