Harmonic adaptive speech coding method and system

US 5,787,387 A
Filed: 07/11/1994
Issued: 07/28/1998
Est. Priority Date: 07/11/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method for processing an audio signal comprising the steps of:

dividing the signal into segments, each segment representing one of a succession of time intervals;

detecting for each segment the presence of a fundamental frequency;

if such a fundamental frequency is detected, estimating the amplitudes of a set of sinusoids harmonically related to the detected fundamental frequency, the set of sinusoids being representative of the signal in the time segment; and

encoding for subsequent storage and transmission the set of the estimated harmonic amplitudes, each amplitude being normalized by the sum of all amplitudes.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system is provided for encoding and decoding of speech signals at a low bit rate. The continuous input speech is divided into voiced and unvoiced time segments of a predetermined length. The encoder of the system uses a linear predictive coding model for the unvoiced speech segments and harmonic frequencies decomposition for the voiced speech segments. Only the magnitudes of the harmonic frequencies are determined using the discrete Fourier transform of the voiced speech segments. The decoder synthesizes voiced speech segments using the magnitudes of the transmitted harmonics and estimates the phase of each harmonic from the signal in the preceding speech segments. Unvoiced speech segments are synthesized using linear prediction coding (LPC) coefficients obtained from codebook entries for the poles of the LPC coefficient polynomial. Boundary conditions between voiced and unvoiced segments are established to insure amplitude and phase continuity for improved output speech quality.

149 Citations

View as Search Results

52 Claims

1. A method for processing an audio signal comprising the steps of:
- dividing the signal into segments, each segment representing one of a succession of time intervals;
  
  detecting for each segment the presence of a fundamental frequency;
  
  if such a fundamental frequency is detected, estimating the amplitudes of a set of sinusoids harmonically related to the detected fundamental frequency, the set of sinusoids being representative of the signal in the time segment; and
  
  encoding for subsequent storage and transmission the set of the estimated harmonic amplitudes, each amplitude being normalized by the sum of all amplitudes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1 wherein the audio signal is a speech signal and following the step of detecting the method further comprises the step of determining whether a segment represents voiced or unvoiced speech on the basis of the detected fundamental frequency.
  - 3. The method of claim 2 further comprising the steps of:
    - computing a set of linear predictive coding (LPC) coefficients for each segment determined to be unvoiced; and
      
      encoding the LPC coefficients by computing the roots of a LPC coefficients polynomial.
  - 4. The method of claim 3 further comprising the step of encoding the linear prediction error power associated with the computed LPC coefficients.
  - 5. The method of claim 4 wherein the step of encoding the LPC coefficients comprises the step of computing the roots of a LPC coefficients polynomial and encoding the computed polynomial roots.
  - 6. The method of claim 5 wherein the step of encoding the computed polynomial roots comprises the steps of:
    - forming a vector of the computed polynomial roots; and
      
      vector quantizing the formed vector using a neural network to determine a vector codebook entry.
  - 7. The method of claim 5 further comprising the step of forming a data packet corresponding to each unvoiced segment for subsequent transmission or storage, the packet comprising a flag indicating that the speech segment is unvoiced, the vector codebook entry for the roots of the LPC coefficients polynomial and the linear prediction error power associated with the computed LPC coefficients.
  - 8. The method of claim 3 wherein each segment determined to be unvoiced is windowed with a normalized Hamming window prior to the step of computing the LPC coefficients.
  - 9. The method of claim 2 wherein the step of estimating harmonic amplitudes comprises the steps of:
    - performing a discrete Fourier transform (DFT) of the speech signal; and
      
      computing a root sum square of the samples of the power DFT of said speech signal in the neighborhood of each harmonic frequency to obtain an estimate of the corresponding harmonic amplitude.
  - 10. The method of claim 9 wherein prior to the step of performing a DFT the speech signal is windowed by a window function providing reduced spectral leakage.
  - 11. The method of claim 10 wherein the used window is a normalized Kaiser window.
  - 12. The method of claim 10 wherein the computation of the DFT is accomplished using a fast Fourier transform (FFT) of the windowed segment.
  - 13. The method of claim 10 wherein the estimates of the harmonic amplitudes A_H (h,F₀) are computed according to the equation:
    - ##EQU12## where A_H (h,F₀) is the estimated amplitude of the h-th harmonic frequency;
      
      F₀ is the fundamental frequency;
      
      B is the half bandwidth of the main lobe of the Fourier transform of the window function; and
      
      Y_2N (n) is the windowed input signal padded with N zeros.
  - 14. The method of claim 13 wherein following the computation of the harmonic amplitudes A_H (h,F₀) each amplitude is normalized by the sum of all amplitudes and is encoded to obtain a harmonic amplitude vector having H elements representative of the signal segment.
  - 15. The method of claim 14 further comprising the step of forming a data packet corresponding to each voiced segment for subsequent transmission or storage, the packet comprising a flag indicating that the speech segment is voiced, the fundamental frequency, the normalized harmonic amplitude vector and the sum of all harmonic amplitudes.

16. A method for synthesizing audio signals from data packets, at least one of the data packets representing a time segment of a signal characterized by the presence of a fundamental frequency, said at least one data packet comprising a sequence of encoded amplitudes of harmonic frequencies related to the fundamental frequency, the method comprising the steps of:
- for each data packet detecting the presence of a fundamental frequency; and
  
  synthesizing an audio signal in response only to the detected fundamental frequency and the sequence of amplitudes of harmonic frequencies in said at least one data packet.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 17. The method of claim 16 wherein the audio signals being synthesized are speech signals and wherein following the step of detecting the method further comprises the steps of:
    - determining whether a data packet represents a voiced or unvoiced speech segment on the basis of the detected fundamental frequency;
      
      synthesizing unvoiced speech in response to encoded information in a data packet determined to represent unvoiced speech; and
      
      providing amplitude and phase continuity on the boundary between adjacent synthesized speech segments.
  - 18. The method of claim 17 wherein the step of synthesizing unvoiced speech comprises the step of passing a white noise signal through an autoregressive digital filter the coefficients of which are the LPC coefficients corresponding to the unvoiced speech segment and the gain of the filter is adjusted on the basis of the prediction error power associated with the LPC coefficients.
  - 19. The method of claim 17 wherein the step of synthesizing a voiced speech comprises the steps of:
    - determining the initial phase offsets for each harmonic frequency; and
      
      synthesizing voiced speech using the encoded sequence of amplitudes of harmonic frequencies and the determined phase offsets.
  - 20. The method of claim 19 wherein the voiced speech is synthesized using the equation:
    - ##EQU13## where A^- (h) is the amplitude of the signal at the end of the previous segment;
      
      φ
      
      (m)=2π
      
      m F₀ /f_s, where F₀ is the fundamental frequency and f_s is the sampling frequency; and
      
      ξ
      
      (h) is the initial phase of the h-th harmonic.
  - 21. The method of claim 20 wherein phase continuity for each harmonic frequency in adjacent voiced segments is insured using the boundary condition:
    - space="preserve" listing-type="equation">ξ
      
      (h)=(h+1)φ
      
      .sup.- (M)+ξ
      
      .sup.- (h),
      where φ
      
      ^- (M) and ξ
      
      ^- (h) are the corresponding quantities of the previous segment.
  - 22. The method of claim 20 wherein the initial phase for each harmonic frequency in an unvoiced-to-voiced transition is computed using the condition:
    - space="preserve" listing-type="equation">ξ
      
      (h)=sin.sup.-1 (α
      
      );
      
      ##EQU14## where S(M) is the M-th sample of the unvoiced speech segment;
      
      A are the harmonic amplitudes for i=0, . . . , H-1; and
      
      |α
      
      |<
      
      1, and φ
      
      (m) is evaluated at the M+1 sample.
  - 23. The method of claim 22 further comprising the step of generating sound effects by changing the fundamental frequency F₀ and the values of the harmonic amplitudes encoded in the data packet.
  - 24. The method of claim 22 further comprising the step of generating sound effects by changing the length of the synthesized signal segments.
  - 25. The method of claim 17 wherein the step of synthesizing voiced speech comprises the steps of:
    - computing the frequencies of the harmonics on the basis of the fundamental frequency of the segment;
      
      generating voiced speech as a superposition of harmonic frequencies with amplitudes corresponding to the encoded amplitudes in the voiced data packet and phases determined as to insure phase continuity at the boundary between adjacent speech segments.
  - 26. The method of claim 17 wherein the step of providing amplitude and phase continuity on the boundary between adjacent synthesized speech segments comprises the steps of:
    - determining the difference between the amplitude A(h) of h-th harmonic in the current segment and the corresponding amplitude A^- (h) of the previous segment, the difference being denoted as Δ
      
      A(h); and
      providing a linear interpolation of the current segment amplitude between the end points of the segment using the formula;
      
      space="preserve" listing-type="equation">A(h,m)=A.sup.- (h,0)+m.Δ
      
      A(h)/M, for m=0, . . . ,M-1.27.

27. A system for processing audio signals comprising:
- means for dividing an audio signal into segments, each segment representing one of a succession of time intervals;
  
  means for detecting for each segment the presence of a fundamental frequency;
  
  means for estimating the amplitudes of a set of sinusoids harmonically related to the detected fundamental frequency, the set of sinusoids being representative of the signal in the time segment; and
  
  means for encoding the set of harmonic amplitudes, each amplitude being normalized by the sum of all amplitudes.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
- - 28. The system of claim 27 wherein the audio signal is a speech signal and the system further comprises means for determining whether a segment represents voiced or unvoiced speech on the basis of the detected fundamental frequency.
  - 29. The system of claim 28 further comprising:
    - means for computing a set of linear predictive coding (LPC) coefficients corresponding to a speech segment; and
      
      means for encoding the LPC coefficients and the linear prediction error power associated with the computed LPC coefficients.
  - 30. The system of claim 29 wherein the means for encoding the LPC coefficients comprises means for computing the roots of a LPC coefficients polynomial and means for encoding polynomial roots into a codebook entry.
  - 31. The system of claim 30 wherein the means for encoding polynomial roots comprises a neural network providing the capability of vector quantizing the polynomial roots into a vector codebook entry.
  - 32. The system of claim 28 further comprising windowing means providing the capability of multiplying the signal segment with the coefficients of a predetermined window function.
  - 33. The system of claim 28 wherein the means for estimating harmonic amplitudes comprises:
    - means for performing a discrete Fourier transform (DFT) of a digitized signal segment; and
      
      means for computing a root sum square of the samples of the DFT in the neighborhood of a harmonic frequency, said means obtaining an estimate of the amplitude of the harmonic frequency.
  - 34. The system of claim 33 wherein the means for performing a DFT computation comprises means for performing a fast Fourier transform (FFT) of the signal segment.
  - 35. The system of claim 33 further comprising means for padding the input signal with zeros.
  - 36. The system of claim 33 further comprising means for normalizing the computed harmonic amplitudes.
  - 37. The system of claim 36 further comprisingmeans for forming a data packet corresponding to each unvoiced segment, the packet comprising a flag indicating that the speech segment is unvoiced, the codebook entry for the roots of the LPC coefficients polynomial and the linear prediction error power associated with the computed LPC coefficients;
    - andmeans for forming a data packet corresponding to each voiced segment for subsequent transmission or storage, the packet comprising a flag indicating that the speech segment is voiced, the fundamental frequency, a vector of the normalized harmonic amplitudes and the sum of all harmonic amplitudes.

38. A system for synthesizing audio signals from data packets, at least one of the data packets representing a time segment of a signal characterized by the presence of a fundamental frequency, said at least one data packet comprising a sequence of encoded amplitudes of harmonic frequencies related to the fundamental frequency, the system comprising:
- means for determining the fundamental frequency of the signal represented by said at least one data packet;
  
  means for synthesizing an audio signal segment in response to the determined fundamental frequency and the sequence of amplitudes of harmonic frequencies in said at least one data packet; and
  
  means for providing amplitude and phase continuity on the boundary between adjacent synthesized audio signal segments.
- View Dependent Claims (39, 40, 41, 42, 43)
- - 39. The system of claim 38 wherein the means for synthesizing comprises means for determining the initial phase offsets for each harmonic frequency.
  - 40. The system of claim 39 wherein the means for providing amplitude and phase continuity comprises means for providing a linear interpolation between the values of the amplitude of the signal at the end points of the segment.
  - 41. The system of claim 39 wherein the means for providing amplitude and phase continuity further comprises means for computing conditions for phase continuity between harmonic frequencies in adjacent speech segments in accordance with the formula:
    - space="preserve" listing-type="equation">ξ
      
      (h)=(h+1)φ
      
      .sup.- (M)+ξ
      
      .sup.- (h),
      where ξ
      
      (h) is the initial phase of the h-th harmonic of the current segment;
      
      φ
      
      (m)=2π
      
      m F₀ /f_s, where F₀ is the fundamental frequency and f_s is the sampling frequency; and
      
      ξ
      
      ^- (M) and ξ
      
      ^- (h) are the corresponding quantities of the previous segment.
  - 42. The system of claim 41 further comprising means for generating sound effects by changing the fundamental frequency F₀, and the encoded values of the harmonic amplitudes.
  - 43. The system of claim 41 further comprising means for generating sound effects by changing the size of synthesized signal segments.

44. A system for synthesizing speech from data packets, the data packets representing voiced or unvoiced speech segments, comprising:
- means for determining whether a data packet represents a voiced or unvoiced speech segment;
  
  means for synthesizing unvoiced speech in response to encoded information in an unvoiced data packet;
  
  means for synthesizing voiced speech segment signal in response only to a sequence of amplitudes of harmonic frequencies encoded in a voiced data packet; and
  
  means for providing amplitude and phase continuity on the boundary between adjacent synthesized speech segments.
- View Dependent Claims (45, 46, 47)
- - 45. The system of claim 44 wherein the means for synthesizing unvoiced speech comprises:
    - means for generating white noise;
      
      a digital synthesis filter;
      
      means for initializing the coefficients of the synthesis filter using a set of parameters representative of an unvoiced speech segment, and means for adjusting the gain of the synthesis filter.
  - 46. The system of claim 44 wherein the means for synthesizing a voiced speech segment comprises means for determining the initial phase offsets for each harmonic frequency.
  - 47. The system of claim 44 wherein the means for providing amplitude and phase continuity comprises means for providing a linear interpolation between the values of the signal amplitude at the end points of the segment.

48. A method for processing an audio signal comprising the steps of:
- dividing the signal into segments, each segment representing one of a succession of time intervals;
  
  detecting for each segment the presence of a fundamental frequency;
  
  if such a fundamental frequency is detected, estimating the amplitudes of a set of sinusoids harmonically related to the detected fundamental frequency, the set of sinusoids being representative of the signal in the time segment;
  
  encoding for subsequent storage and transmission the set of the estimated harmonic amplitudes, each amplitude being normalized by the sum of all amplitudes; and
  
  synthesizing an audio signal in response only to the fundamental frequency and the sequence of normalized amplitudes of harmonic frequencies.
- View Dependent Claims (49, 50, 51, 52)
- - 49. The method of claim 48 wherein the step of estimating harmonic amplitudes comprises the steps of:
    - performing a discrete Fourier transform (DFT) of the speech signal;
      
      computing a root sum square of the samples of the power DFT of said speech signal in the neighborhood of each harmonic frequency to obtain an estimate of the corresponding harmonic amplitude, wherein prior to the step of performing a DFT the speech signal is windowed by a window function providing reduced spectral leakage.
  - 50. The method of claim 49 wherein the estimates of the harmonic amplitudes A_H (h,F₀) are computed according to the equation:
    - ##EQU15## where A_H (h,F₀) is the estimated amplitude of the h-th harmonic frequency;
      
      F₀ is the fundamental frequency;
      
      B is the half bandwidth of the main lobe of the Fourier transform of the window function; and
      
      y_2N (n) is the windowed input signal padded with N zeros.
  - 51. The method of claim 48 wherein the audio signal is a voice signal and the step of synthesizing the voice signal comprises the steps of:
    - computing the frequencies of the harmonics on the basis of the fundamental frequency of the segment; and
      
      generating voiced speech as a superposition of harmonic frequencies with amplitudes corresponding to the encoded amplitudes and phases determined as to insure phase continuity at the boundary between adjacent speech segments.
  - 52. The method of claim 51 wherein the voiced speech is synthesized using the equation:
    - ##EQU16## where A^- (h) is the amplitude of the signal at the end of the previous segment;
      
      φ
      
      (m)=2π
      
      m F₀ /f_s, where F₀ is the fundamental frequency and f_s is the sampling frequency; and
      
      ξ
      
      (h) is the initial phase of the h-th harmonic.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Voxware, Inc.
Inventors
Aguilar, Joseph Gerard
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/273,069
Time in Patent Office

1,478 Days
Field of Search

395/2.17, 395/2.28, 395/2.29, 395/2.67, 395/2.77, 395/2.71, 704/208, 704/219, 704/220, 704/258, 704/262, 704/268, 704/214, 704/207, 704/205, 704/206
US Class Current

704/208
CPC Class Codes

G10L 19/06   Determination or coding of ...

G10L 2019/0001   Codebooks

G10L 25/27   characterised by the analys...

G10L 25/90   Pitch determination of spee...

Harmonic adaptive speech coding method and system

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

149 Citations

52 Claims

Specification

Use Cases

Quick Links

Others

Harmonic adaptive speech coding method and system

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

149 Citations

52 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others