Methods and devices for source controlled variable bit-rate wideband speech coding

US 20050177364A1
Filed: 01/19/2005
Published: 08/11/2005
Est. Priority Date: 10/11/2002
Status: Active Grant

First Claim

Patent Images

1. A source-controlled Variable bit-rate Multi-mode WideBand (VMR-WB) codec comprising a unit operable with an Adaptive Multi-Rate wideband (AMR-WB) codec, where in a VMR-WB encoding/AMR-WB decoding case, speech frames are encoded in an AMR-WB interoperable mode of a VMR-WB encoder using one of bit rates corresponding to Interoperable-Full Rate (I-FR) for active speech frames, Interoperable-Half Rate (I-HR) at least for dim-and-burst signaling, Quarter Rate-Comfort Noise Generator (CNG-QR) to encode at least relevant background noise frames and Eighth Rate-Comfort Noise Generator (CNG-ER) frames for background noise frames not encoded as CNG-QR frames, said unit responsive to a case that voice activity is not detected for using CNG-ER encoding, further responsive to a case that voice activity is detected, and responsive to a voiced versus unvoiced classification such that if a frame is classified as unvoiced, the frame is encoded with one of Unvoiced HR or Unvoiced QR encoding, further responsive to a frame not being classified as unvoiced for using a stable voiced classification, and if the frame is classified as stable voiced, encoded the frame using Voiced HR encoding, else assuming the frame to likely contain a non-stationary speech segment for using an appropriate FR encoding, whereas a frame with low energy, and not detected as at least a background or an unvoiced frame, is encoded using generic HR coding to reduce the average data rate;

an unvoiced classification decision being based on at least some of a voicing measure {overscore (r)}_x, a spectral tilt e_t, an energy variation within a frame dE, and a relative frame energy E_rel, where decision thresholds are set based at least in part on an operating mode comprising a required average data rate.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech signal classification and encoding systems and methods are disclosed herein. The signal classification is done in three steps each of them discriminating a specific signal class. First, a voice activity detector (VAD) discriminates between active and inactive speech frames. If an inactive speech frame is detected (background noise signal) then the classification chain ends and the frame is encoded with comfort noise generation (CNG). If an active speech frame is detected, the frame is subjected to a second classifier dedicated to discriminate unvoiced frames. If the classifier classifies the frame as unvoiced speech signal, the classification chain ends, and the frame is encoded using a coding method optimized for unvoiced signals. Otherwise, the speech frame is passed through to the “stable voiced” classification module. If the frame is classified as stable voiced frame, then the frame is encoded using a coding method optimized for stable voiced signals. Otherwise, the frame is likely to contain a non-stationary speech segment such as a voiced onset or rapidly evolving voiced speech signal. In this case a general-purpose speech coder is used at a high bit rate for sustaining good subjective quality.

Citations

85 Claims

1. A source-controlled Variable bit-rate Multi-mode WideBand (VMR-WB) codec comprising a unit operable with an Adaptive Multi-Rate wideband (AMR-WB) codec, where in a VMR-WB encoding/AMR-WB decoding case, speech frames are encoded in an AMR-WB interoperable mode of a VMR-WB encoder using one of bit rates corresponding to Interoperable-Full Rate (I-FR) for active speech frames, Interoperable-Half Rate (I-HR) at least for dim-and-burst signaling, Quarter Rate-Comfort Noise Generator (CNG-QR) to encode at least relevant background noise frames and Eighth Rate-Comfort Noise Generator (CNG-ER) frames for background noise frames not encoded as CNG-QR frames, said unit responsive to a case that voice activity is not detected for using CNG-ER encoding, further responsive to a case that voice activity is detected, and responsive to a voiced versus unvoiced classification such that if a frame is classified as unvoiced, the frame is encoded with one of Unvoiced HR or Unvoiced QR encoding, further responsive to a frame not being classified as unvoiced for using a stable voiced classification, and if the frame is classified as stable voiced, encoded the frame using Voiced HR encoding, else assuming the frame to likely contain a non-stationary speech segment for using an appropriate FR encoding, whereas a frame with low energy, and not detected as at least a background or an unvoiced frame, is encoded using generic HR coding to reduce the average data rate;
- an unvoiced classification decision being based on at least some of a voicing measure {overscore (r)}_x, a spectral tilt e_t, an energy variation within a frame dE, and a relative frame energy E_rel, where decision thresholds are set based at least in part on an operating mode comprising a required average data rate.

2. A method for encoding a sampled speech signal comprising speech frames, the method comprising:
- determining whether a current frame of the sampled speech signal is an active speech frame or an inactive speech frame, if said current frame is an active speech frame, performing a classification procedure to determine whether the current frame is an unvoiced frame, said classification procedure comprising examining at least three of the following parameters in order to determine whether the current frame is an unvoiced frame;
  
  a) a voicing measure (r_x,{overscore (r)}_x);
  
  b) a spectral tilt measure (e_tilt, e_t);
  
  c) an energy variation within the current frame (dE);
  
  d) a relative energy of the current frame (E_rel);
  
  and when the current frame is classified as an unvoiced frame by said classification procedure, encoding the current frame using an unvoiced signal coding algorithm.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 3. A method according to claim 2, wherein the voicing measure ({overscore (r)}_x) is defined as ${\overline{r}}_{x} = \frac{1}{3}$
    - ( r x ⁢
      
      ⁡
      
      ( 0 ) + r x ⁡
      
      ( 1 ) + r x ⁡
      
      ( 2 ) ) where r_x(0), r_x(1) and r_x(2) are respectively a normalized correlation of the first half of said current frame, a normalized correlation of the second half of said current frame, and a normalized correlation of the first half of a frame following said current frame.
  - 4. A method according to claim 3, further comprising adding a noise correction factor (r_e) to said voicing measure ({overscore (r)}_x).
  - 5. A method according to claim 2, comprising defining a number of perceptual critical bands representative of frequency ranges within an energy spectrum of the current frame, ordered according to increasing frequency from a first perceptual critical band representative of a lowest frequency range to a last perceptual critical band representative of a highest frequency range, and performing a spectral analysis of the current frame to determine a distribution of energy amongst the perceptual critical bands.
  - 6. A method according to claim 2, wherein the spectral tilt is proportional to a ratio between the energy of the current frame at low frequencies and the energy of the current frame at high frequencies.
  - 7. A method according to claim 5, comprising computing a measure ({overscore (E)}_h) representative of the energy of the current frame at high frequencies by calculating an average of the energies of the last two perceptual critical bands.
  - 8. A method according to claim 5, comprising computing a measure ({overscore (E)}₁) representative of the energy of the current frame at low frequencies by calculating an average of the energies in the first i perceptual critical bands.
  - 9. A method according to claim 5, comprising computing a measure ({overscore (E)}₁) representative of the energy of the current frame at low frequencies by calculating an average of the energies in the first i perceptual critical bands excluding the first perceptual critical band.
  - 10. A method according to claim 8, further comprising determining a speech pitch period and for speech pitch periods shorter than a predetermined value, computing the low frequency energy measure ({overscore (E)}_l) by summing the energy within frequency bins resulting from spectral analysis of the current frame and only frequency bins sufficiently close to the speech harmonics are taken into account in the summation according to the formula:
    - ${\overline{E}}_{l} = \frac{1}{cnt} \sum_{k = K_{\min}}^{24} E_{BIN} (k) w_{h} (k)$ where E_BIN(k) are energies within frequency bins, K_minis the index of the first frequency bin taken into account in the summation, cnt is the number of non-zero terms in the summation, and w_h(k) is set to 1 when the distance between the frequency bin and the nearest harmonic is not larger than a predetermined frequency threshold and w_h(k) is set to zero otherwise.
  - 11. A method according to claim 8, further comprising determining a speech pitch period and for pitch values larger than a predetermined value, computing the low frequency energy measure ({overscore (E)}₁) according to the formula:
    - ${\overline{E}}_{l} = \frac{1}{i} \sum_{k = 0}^{i - 1} E_{CB} (k)$ where E_CB(k) is the energy of perceptual critical band k.
  - 12. A method according to claim 8, further comprising identifying an a priori unvoiced sound when
    r_x(0)+r_x(1)+r_e<
    - 0.6 and computing the low frequency energy measure ({overscore (E)}_i) according to the formula;
      
      ${\overline{E}}_{l} = \frac{1}{i} \sum_{k = 0}^{i - 1} E_{CB} (k)$ where E_CB(k) is the energy of perceptual critical band k.
  - 13. A method according to claim 7, further comprising:
    - computing a measure (N_h) representative of a noise energy of the current frame at high frequencies by calculating an average of the noise energies of the last two perceptual critical bands;
      
      computing a measure (N_l) representative of a noise energy of the current frame at low frequencies by calculating an average of the noise energies in the first i perceptual critical bands;
      
      subtracting the high frequency noise measure (N_h) from the high frequency energy measure ({overscore (E)}_h) to obtain a high frequency energy (E_h);
      
      subtracting the low frequency noise measure (N_l) from the low frequency energy measure ({overscore (E)}_l) to obtain a low frequency energy (E_l); and
      
      computing the spectral tilt measure (e_tilt) as a ratio of the low frequency energy (E_t) divided by the high frequency energy (E_h).
  - 14. A method according to claim 13, comprising performing the spectral analysis of claim 4 twice for the current frame, once for a first half of the current frame and once for a second half of the current frame and further computing the spectral tilt measure (e_tilt) twice for the current frame, once for each spectral analysis, to obtain a first spectral tilt measure (e_tilt(0)) for the first half of the current frame and a second spectral tilt measure (e_tilt(1)) for the second half of the current frame.
  - 15. A method according to claim 14, further comprising computing an average spectral tilt (e_t) according to the formula:
    - $e_{l} = \frac{1}{3} (e_{old} + e_{tilt} (0) + e_{tilt} (1))$ where e_oldis a spectral tilt measure obtained from spectral analysis of the second half of the previous frame.
  - 16. A method according to claim 2, comprising calculating the relative energy (E_rel) of the current frame as a difference between a frame energy (E_t) in dB and a long-term average energy value ({overscore (E)}_f).
  - 17. A method according to claim 16, comprising computing the frame energy (E_t) as according to the formula:
    - $E_{t} = 10 \log (\sum_{i = 0}^{19} E_{CB} (i)), dB$ where E_CB(i) are the average energies per critical band.
  - 18. A method according to claim 16, comprising computing the long-term average energy value according to the formula:
    - {overscore (E)}_f=0.99{overscore (E)}_f+0.01E_twhere {overscore (E)}_fhas an initial value of 45 dB.
  - 19. A method according to claim 2, comprising selecting an encoding bit-rate from a set of available encoding bit-rates, and encoding the current frame in accordance with the selected encoding bit-rate.
  - 20. A method according to claim 19, wherein the set of available encoding bit-rates includes a full-rate encoding bit-rate, a half-rate encoding bit-rate, a quarter-rate encoding bit-rate and an eighth-rate encoding bit-rate.
  - 21. A method according to claim 20, wherein when the current frame is classified as an unvoiced frame, encoding the current frame at said half-rate encoding bit-rate using an unvoiced half-rate encoding algorithm.
  - 22. A method according to claim 20, wherein said classification procedure to determine whether the current frame is an unvoiced frame further includes determining whether the current frame is located at a transition between voiced and unvoiced speech and, when the current frame is classified as an unvoiced frame and is located at a transition between voiced and unvoiced speech, encoding the current frame at said half-rate encoding bit-rate using an unvoiced half-rate encoding algorithm and, when the current frame is classified as unvoiced speech and is not located at a transition between voiced and unvoiced speech, encoding the current frame at said quarter-rate encoding bit-rate using an unvoiced quarter-rate encoding algorithm.
  - 23. A method according to claim 2, comprising using a comfort noise generation algorithm when it is determined that the current frame is an inactive speech frame.
  - 24. A method according to claim 2, comprising using a discontinuous transmission mode when it is determined that the current frame is an inactive speech frame.
  - 25. A method according to claim 20, comprising defining a set of operating modes, each operating mode providing a predetermined average bit-rate, selecting an operating mode and encoding the sampled speech signal according to the selected operating mode.
  - 26. A method according to claim 25, wherein the set of operating modes comprises a Premium mode having a highest average bit-rate, a Standard mode having an intermediate average bit-rate and an Economy mode having a lowest average bit-rate.
  - 27. A method according to claim 26, wherein when the sampled speech signal is encoded in Premium mode and the current frame is classified as an unvoiced frame, the current frame is encoded at said half-rate encoding bit-rate when the following conditions are fulfilled:
    - said voicing measure is smaller than a predetermined first threshold value; and
      
      said spectral tilt measure is smaller than a predetermined second threshold value; and
      
      said energy variation is smaller than a predetermined third threshold value.
  - 28. A method according to claim 26, wherein when the sampled speech signal is encoded in Standard mode;
    - and the current frame is classified as an unvoiced frame, the current frame is encoded at said half-rate encoding bit-rate when the following conditions are fulfilled;
      
      said voicing measure is smaller than a predetermined fourth threshold value; and
      
      said spectral tilt measure is smaller than a predetermined fifth threshold value; and
      
      said energy variation is smaller than a predetermined sixth threshold value or said relative energy is smaller than a predetermined seventh threshold value.
  - 29. A method according to claim 28, wherein said fourth threshold value is 0.695, said fifth threshold value is 4, said sixth threshold value is 40, and said seventh threshold value is −
    - 14.
  - 30. A method according to claim 26, wherein when the sampled speech signal is encoded in Economy mode and the current frame is classified as an unvoiced frame, the current frame is encoded at said half-rate encoding bit-rate when the following conditions are fulfilled:
    - said voicing measure is smaller than a predetermined eighth threshold value; and
      
      said spectral tilt measure is smaller than a predetermined ninth threshold value; and
      
      said energy variation is smaller than a predetermined tenth threshold value or said relative energy is smaller than a predetermined eleventh threshold value.
  - 31. A method according to claim 30, wherein said eighth threshold value is 0.695, said ninth threshold value is 4, said tenth threshold value is 60, and said eleventh threshold value is −
    - 14
  - 32. A method according to claim 26, wherein when the sampled speech signal is encoded in Economy mode and the current frame is classified as an unvoiced frame the current frame is encoded at said quarter-rate encoding bit-rate when the following further conditions are fulfilled:
    - the normalized correlation in a lookahead frame (r_x(2)) is smaller than a predetermined twelfth threshold value; and
      
      the second spectral tilt measure (e_tilt(1)) for the second half of the current frame is smaller than a predetermined thirteenth threshold value.
  - 33. A method according to claim 32, wherein said twelfth threshold value is 0.73 and said thirteenth threshold value is 3.

34. A device for encoding a sampled speech signal comprising speech frames, the device comprising:
- a voice activity detector for determining whether frames of the sampled speech signal are active speech frames or inactive speech frames;
  
  a classification unit arranged to perform a classification procedure on active speech frames to determine whether said active speech frames are unvoiced frames, said classification procedure comprising examining at least three of the following parameters in order to determine whether a current frame is an unvoiced frame;
  
  a) a voicing measure (r_x,{overscore (r)}_x);
  
  b) a spectral tilt measure (e_tilt,e_t);
  
  c) an energy variation within the current frame (dE);
  
  d) a relative energy of the current frame (E_rel);
  
  said device being arranged to encode the current frame using an unvoiced signal coding algorithm when the classification unit classifies the current frame as an unvoiced frame.
- View Dependent Claims (35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 80, 81)
- - 35. A device according to claim 34, wherein the voicing measure ({overscore (r)}_x) is defined as ${\overline{r}}_{x} = \frac{1}{3}$
    - ( r x ⁢
      
      ⁡
      
      ( 0 ) + r x ⁡
      
      ( 1 ) + r x ⁡
      
      ( 2 ) ) where r_x(0), r_x(1) and r_x(2) are respectively a normalized correlation of the first half of said current frame, a normalized correlation of the second half of said current frame, and a normalized correlation of the first half of a frame following said current frame.
  - 36. A device according to claim 35, further arranged to add a noise correction factor (r_e) to said voicing measure ({overscore (r)}_x).
  - 37. A device according to claim 34, arranged to define a number of perceptual critical bands representative of frequency ranges within an energy spectrum of the current frame, ordered according to increasing frequency from a first perceptual critical band representative of a lowest frequency range to a last perceptual critical band representative of a highest frequency range, and to perform a spectral analysis of the current frame to determine a distribution of energy amongst the perceptual critical bands.
  - 38. A device according to claim 34, wherein the spectral tilt is proportional to a ratio between the energy of the current frame at low frequencies and the energy of the current frame at high frequencies.
  - 39. A device according to claim 37, arranged to compute a measure ({overscore (E)}_h) representative of the energy of the current frame at high frequencies by calculating an average of the energies of the last two perceptual critical bands.
  - 40. A device according to claim 37, arranged to compute a measure ({overscore (E)}_t) representative of the energy of the current frame at low frequencies by calculating an average of the energies in the first i perceptual critical bands.
  - 41. A device according to claim 37, arranged to compute a measure ({overscore (E)}_l) representative of the energy of the current frame at low frequencies by calculating an average of the energies in the first i perceptual critical bands excluding the first perceptual critical band.
  - 42. A device according to claim 40, further arranged to determine a speech pitch period and, for speech pitch periods shorter than a predetermined value, to compute the low frequency energy measure ({overscore (E)}_t) by summing the energy within frequency bins resulting from spectral analysis of the current frame and to take only frequency bins sufficiently close to the speech harmonics into account in the summation according to the formula:
    - ${\overline{E}}_{l} = \frac{1}{cnt} \sum_{k = K_{\min}}^{24} E_{BIN} (k) w_{h} (k)$ where E_BIN(k) are energies within frequency bins, K_minis the index of the first frequency bin taken into account in the summation, cnt is the number of non-zero terms in the summation, and w_h(k) is set to 1 when the distance between the frequency bin and the nearest harmonic is not larger than a predetermined frequency threshold and w_h(k) is set to zero otherwise.
  - 43. A device according to claim 40, further arranged to determine a speech pitch period and for pitch values larger than a predetermined value, and to compute the low frequency energy measure ({overscore (E)}_l) according to the formula:
    - ${\overline{E}}_{l} = \frac{1}{i} \sum_{k = 0}^{i - 1} E_{CB} (k)$ where E_CB(k) is the energy of perceptual critical band k.
  - 44. A device according to claim 40, further arranged to identify an a priori unvoiced sound when
    r_x(0)+r_x(1)+r_e<
    - 0.6 and to compute the low frequency energy measure ({overscore (E)}_l) according to the formula;
      
      ${\overline{E}}_{l} = \frac{1}{i} \sum_{k = 0}^{i - 1} E_{CB} (k)$ where E_CB(k) is the energy of perceptual critical band k.
  - 45. A device according to claim 39, further arranged:
    - to compute a measure (N_h) representative of a noise energy of the current frame at high frequencies by calculating an average of the noise energies of the last two perceptual critical bands;
      
      to compute a measure (N_l) representative of a noise energy of the current frame at low frequencies by calculating an average of the noise energies in the first i perceptual critical bands;
      
      to subtract the high frequency noise measure (N_h) from the high frequency energy measure ({overscore (E)}_h) to obtain a high frequency energy (E_h);
      
      to subtract the low frequency noise measure (N₁) from the low frequency energy measure ({overscore (E)}_l) to obtain a low frequency energy (E_l); and
      
      to compute the spectral tilt measure (e_tilt) as a ratio of the low frequency energy (E_l) divided by the high frequency energy (E_h).
  - 46. A device according to claim 45, arranged to perform the spectral analysis of claim 36 twice for the current frame, once for a first half of the current frame and once for a second half of the current frame and further to compute the spectral tilt measure (e_tilt) twice for the current frame, once for each spectral analysis, to obtain a first spectral tilt measure (e_tilt(0)) for the first half of the current frame and a second spectral tilt measure (e_tilt(1)) for the second half of the current frame.
  - 47. A device according to claim 46, further arranged to compute an average spectral tilt (e_t) according to the formula:
    - $e_{t} = \frac{1}{3} (e_{old} + e_{tilt} (0) + e_{tilt} (1))$ where e_oldis a spectral tilt measure obtained from spectral analysis of the second half of the previous frame.
  - 48. A device according to claim 34, arranged to calculate the relative energy (E_rel) of the current frame as a difference between a frame energy (E_t) in dB and a long-term average energy value ({overscore (E)}_f).
  - 49. A device according to claim 48, arranged to compute the frame energy (E_t) as according to the formula:
    - $E_{t} = 10 \log (\sum_{i = 0}^{19} E_{CB} (i)), dB$ where E_CB(i) are the average energies per critical band.
  - 50. A device according to claim 48, arranged to compute the long-term average energy value according to the formula:
    - {overscore (E)}_f=0.99{overscore (E)}_f+0.01E_twhere {overscore (E)}_fhas an initial value of 45 dB.
  - 51. A device according to claim 34, arranged to select an encoding bit-rate from a set of available encoding bit-rates, and to encode the current frame in accordance with the selected encoding bit-rate.
  - 52. A device according to claim 51, wherein the set of available encoding bit-rates includes a full-rate encoding bit-rate, a half-rate encoding bit-rate, a quarter-rate encoding bit-rate and an eighth-rate encoding bit-rate.
  - 53. A device according to claim 52, arranged to encode the current frame at said half-rate encoding bit-rate using an unvoiced half-rate encoding algorithm when the current frame is classified as an unvoiced frame.
  - 54. A device according to claim 52, further arranged to determine whether the current unvoiced frame is located at a transition between voiced and unvoiced speech and, when the current frame is classified as an unvoiced frame and is located at a transition between voiced and unvoiced speech, to encode the current frame at said half-rate encoding bit-rate using an unvoiced half-rate encoding algorithm and, when the current frame is classified as unvoiced speech and is not located at a transition between voiced and unvoiced speech, to encode the current frame at said quarter-rate encoding bit-rate using an unvoiced quarter-rate encoding algorithm.
  - 55. A device according to claim 34, arranged to use a comfort noise generation algorithm when it is determined that the current frame is an inactive speech frame.
  - 56. A device according to claim 34, arranged to use a discontinuous transmission mode when it is determined that the current frame is an inactive speech frame.
  - 57. A device according to claim 52, arranged to define a set of operating modes, each operating mode providing a predetermined average bit-rate, to select an operating mode and to encode the sampled speech signal according to the selected operating mode.
  - 58. A device according to claim 57, wherein the set of operating modes comprises a Premium mode having a highest average bit-rate, a Standard mode having an intermediate average bit-rate and an Economy mode having a lowest average bit-rate.
  - 59. A device according to claim 58, arranged to encode the current frame at said half-rate encoding bit-rate when the sampled speech signal is encoded in Premium mode and the current frame is classified as an unvoiced frame, and the following conditions are fulfilled:
    - said voicing measure is smaller than a predetermined first threshold value; and
      
      said spectral tilt measure is smaller than a predetermined second threshold value; and
      
      said energy variation is smaller than a predetermined third threshold value.
  - 60. A device according to claim 58, arranged to encode the current frame at said half-rate encoding bit-rate when the sampled speech signal is encoded in Standard mode and the current frame is classified as an unvoiced frame, and the following conditions are fulfilled:
    - said voicing measure is smaller than a predetermined fourth threshold value; and
      
      said spectral tilt measure is smaller than a predetermined fifth threshold value; and
      
      said energy variation is smaller than a predetermined sixth threshold value or said relative energy is smaller than a predetermined seventh threshold value.
  - 61. A device according to claim 60, wherein said fourth threshold value is 0.695, said fifth threshold value is 4, said sixth threshold value is 40, and said seventh threshold value is −
    - 14.
  - 62. A device according to claim 58, arranged to encode the current frame at said half-rate encoding bit-rate when the sampled speech signal is encoded in Economy mode and the current frame is classified as an unvoiced frame, and the following conditions are fulfilled:
    - said voicing measure is smaller than a predetermined eighth threshold value; and
      
      said spectral tilt measure is smaller than a predetermined ninth threshold value; and
      
      said energy variation is smaller than a predetermined tenth threshold value or said relative energy is smaller than a predetermined eleventh threshold value.
  - 63. A device according to claim 62, wherein said eighth threshold value is 0.695, said ninth threshold value is 4, said tenth threshold value is 60, and said eleventh threshold value is −
    - 14
  - 64. A device according to claim 58, arranged to encode the current frame at said quarter-rate encoding bit-rate when the sampled speech signal is encoded in Economy mode and the current frame is classified as an unvoiced frame, and the following further conditions are fulfilled:
    - the normalized correlation in a lookahead frame (r_x(2)) is smaller than a predetermined twelfth threshold value; and
      
      the second spectral tilt measure (e_tilt(1)) for the second half of the current frame is smaller than a predetermined thirteenth threshold value.
  - 65. A device according to claim 64, wherein said twelfth threshold value is 0.73 and said thirteenth threshold value is 3.
  - 80. A program according to claim 79, wherein the actions further comprise performing the spectral analysis of claim 36 twice for the current frame, once for a first half of the current frame and once for a second half of the current frame and further computing the spectral tilt measure (e_tilt) twice for the current frame, once for each spectral analysis, to obtain a first spectral tilt measure (e_tilt(0)) for the first half of the current frame and a second spectral tilt measure (e_tilt(1)) for the second half of the current frame.
  - 81. A program according to claim 80, wherein the actions further comprise computing an average spectral tilt (e_t) according to the formula:
    - $e_{t} = \frac{1}{3} (e_{old} + e_{tilt} (0) + e_{tilt} (1))$ where e_oldis a spectral tilt measure obtained from spectral analysis of the second half of the previous frame.

66. A device for encoding a sampled speech signal comprising speech frames, the device comprising:
- means for determining whether a current frame of the sampled speech signal is an active speech frame or an inactive speech frame, means, responsive to said current frame being an active speech frame, for performing a classification procedure to determine whether the current frame is an unvoiced frame, said classification procedure comprising examining at least three of the following parameters in order to determine whether the current frame is an unvoiced frame;
  
  a) a voicing measure (r_x,{overscore (r)}_x);
  
  b) a spectral tilt measure (e_tilt, e_t);
  
  c) an energy variation within the current frame (dE);
  
  d) a relative energy of the current frame (E_rel);
  
  and means for encoding the current frame using an unvoiced signal coding algorithm when the current frame is classified as an unvoiced frame by said classification procedure.

67. A speech encoder, responsive to a current frame being classified as an active speech frame, for encoding said current frame using an unvoiced signal coding algorithm, wherein an active speech frame is further classified as an active unvoiced speech frame by examining at least three parameters selected from the set:
- a voicing measure (r_x,{overscore (r)}_x), a spectral tilt measure (e_tilt,e_t), an energy variation within the current frame (dE), and a relative energy of the current frame (E_rel).

68. A program of machine-readable instructions, tangibly embodied on an information bearing medium and executable by a digital data processor, to perform actions directed toward encoding a sampled speech signal comprising speech frames, the actions comprising:
- determining whether a current frame of the sampled speech signal is an active speech frame or an inactive speech frame, performing a classification procedure on an active speech frame to determine whether the current frame is an unvoiced frame, said classification procedure comprising examining at least three of the following parameters in order to determine whether the current frame is an unvoiced frame;
  
  a) a voicing measure (r_x,{overscore (r)}_x);
  
  b) a spectral tilt measure (e_tilt,e_t);
  
  c) an energy variation within the current frame (dE);
  
  d) a relative energy of the current frame (E_rel);
  
  and encoding the current frame using an unvoiced signal coding algorithm when the current frame is classified as an unvoiced frame by said classification procedure.
- View Dependent Claims (69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 82, 83, 84, 85)
- - 69. A program according to claim 68, wherein the voicing measure ({overscore (r)}_x) is defined as ${\overline{r}}_{x} = \frac{1}{3}$
    - ( r x ⁡
      
      ( 0 ) + r x ⁡
      
      ( 1 ) + r x ⁡
      
      ( 2 ) ) where r_x(0), r_x(1) and r_x(2) are respectively a normalized correlation of the first half of said current frame, a normalized correlation of the second half of said current frame, and a normalized correlation of the first half of a frame following said current frame.
  - 70. A program according to claim 69, wherein the actions further comprise adding a noise correction factor (r_e) to said voicing measure ({overscore (r)}_x).
  - 71. A program according to claim 68, wherein the actions further comprise defining a number of perceptual critical bands representative of frequency ranges within an energy spectrum of the current frame, ordered according to increasing frequency from a first perceptual critical band representative of a lowest frequency range to a last perceptual critical band representative of a highest frequency range, and performing a spectral analysis of the current frame to determine a distribution of energy amongst the perceptual critical bands.
  - 72. A program according to claim 68, wherein the spectral tilt is proportional to a ratio between the energy of the current frame at low frequencies and the energy of the current frame at high frequencies.
  - 73. A program according to claim 71, wherein the actions further comprise computing a measure ({overscore (E)}_h) representative of the energy of the current frame at high frequencies by calculating an average of the energies of the last two perceptual critical bands.
  - 74. A program according to claim 71, wherein the actions further comprise computing a measure ({overscore (E)}_l) representative of the energy of the current frame at low frequencies by calculating an average of the energies in the first i perceptual critical bands.
  - 75. A program according to claim 71, wherein the actions further comprise computing a measure ({overscore (E)}_l) representative of the energy of the current frame at low frequencies by calculating an average of the energies in the first i perceptual critical bands excluding the first perceptual critical band.
  - 76. A program according to claim 74, wherein the actions further comprise determining a speech pitch period and for speech pitch periods shorter than a predetermined value, and computing the low frequency energy measure ({overscore (E)}_l) by summing the energy within frequency bins resulting from spectral analysis of the current frame and only frequency bins sufficiently close to the speech harmonics are taken into account in the summation according to the formula:
    - ${\overline{E}}_{l} = \frac{1}{cnt} \sum_{k = K_{\min}}^{24} E_{BIN} (k) w_{h} (k)$ where E_BIN(k) are energies within frequency bins, K_minis the index of the first frequency bin taken into account in the summation, cnt is the number of non-zero terms in the summation, and w_h(k) is set to 1 when the distance between the frequency bin and the nearest harmonic is not larger than a predetermined frequency threshold and w_h(k) is set to zero otherwise.
  - 77. A program according to claim 74, wherein the actions further comprise determining a speech pitch period and for pitch values larger than a predetermined value, and computing the low frequency energy measure ({overscore (E)}_l) according to the formula:
    - ${\overline{E}}_{l} = \frac{1}{i} \sum_{k = 0}^{i - 1} E_{CB} (k)$ where E_CB(k) is the energy of perceptual critical band k.
  - 78. A program according to claim 74, wherein the actions further comprise identifying an a priori unvoiced sound when
    r_x(0)+r_x(1)+r_e<
    - 0.6 and computing the low frequency energy measure ({overscore (E)}_l) according to the formula;
      
      ${\overline{E}}_{l} = \frac{1}{i} \sum_{k = 0}^{i - 1} E_{CB} (k)$ where E_CB(k) is the energy of perceptual critical band k.
  - 79. A program according to claim 73, wherein the actions further comprise:
    - computing a measure (N_h) representative of a noise energy of the current frame at high frequencies by calculating an average of the noise energies of the last two perceptual critical bands;
      
      computing a measure (N_l) representative of a noise energy of the current frame at low frequencies by calculating an average of the noise energies in the first i perceptual critical bands;
      
      subtracting the high frequency noise measure (N_h) from the high frequency energy measure ({overscore (E)}_h) to obtain a high frequency energy (E_h);
      
      subtracting the low frequency noise measure (N_l) from the low frequency energy measure ({overscore (E)}_l) to obtain a low frequency energy (E_l); and
      
      computing the spectral tilt measure (e_tilt) as a ratio of the low frequency energy (E_l) divided by the high frequency energy (E_h).
  - 82. A program according to claim 68, wherein the actions further comprise computing the relative energy (E_rel) of the current frame as a difference between a frame energy (E_t) in dB and a long-term average energy value ({overscore (E)}_f).
  - 83. A program according to claim 82, wherein the actions further comprise computing the frame energy (E_t) as according to the formula:
    - $E_{t} = 10 \log (\sum_{i = 0}^{19} E_{CB} (i)), dB$ where E_CB(i) are the average energies per critical band.
  - 84. A program according to claim 82, wherein the actions further comprise computing the long-term average energy value according to the formula:
    - {overscore (E)}_f=0.99{overscore (E)}_f+0.01E_twhere {overscore (E)}_fhas an initial value of 45 dB.
  - 85. The program of claim 68, wherein the information bearing medium and the digital data processor are disposed within a mobile station.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nokia Technologies Oy (Nokia Corporation)
Original Assignee
Nokia Corporation
Inventors
Jelinek, Milan

Granted Patent

US 7,657,427 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/214
CPC Class Codes

G10L 19/012   Comfort noise or silence co...

G10L 19/20   using sound class specific ...

G10L 19/24   Variable rate codecs, e.g. ...

G10L 25/93   Discriminating between voic...

Methods and devices for source controlled variable bit-rate wideband speech coding

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

85 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and devices for source controlled variable bit-rate wideband speech coding

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

85 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links