Parametric speech codec for representing synthetic speech in the presence of background noise

US 7,092,881 B1
Filed: 07/26/2000
Issued: 08/15/2006
Est. Priority Date: 07/26/1999
Status: Active Grant

First Claim

Patent Images

1. A system for processing an audio signal comprising:

means for dividing the audio signal into segments, each segment representing a portion of the audio signal occurring in one of a succession of time intervals;

means for detecting for each segment the presence of a fundamental frequency;

means responsive to the detecting means for determining the voicing probability for each segment by computing a ratio between voiced and unvoiced components of the audio signal, the determining means comprising;

means for windowing each segment of the audio signal;

means for computing the spectrum of the windowed segment;

means for computing correlation coefficients of each segment using at least the spectrum;

means for estimating a voicing threshold for each segment, comprising;

means for dividing the spectrum into a plurality of non-linear bands, wherein the low bands of the spectrum have a higher resolution than the high bands of the spectrum;

means for evaluating at least one voice measurement for each of the plurality of bands; and

means for determining the voicing threshold for each segment using the at least one voice measurement; and

means for comparing the correlation coefficients with the voicing threshold for each segment;

means for separating the signal in each segment into a voiced portion and an unvoiced portion on the basis of the voicing probability, wherein the voiced portion of the signal occupies the low end of the spectrum and the unvoiced portion of the signal occupies the high end of the spectrum for each segment; and

means for separately encoding the voiced portion and the unvoiced portion of the audio signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are provided for processing audio and speech signals using a pitch and voicing dependent spectral estimation algorithm (voicing algorithm) to accurately represent voiced speech, unvoiced speech, and mixed speech in the presence of background noise, and background noise with a single model. The present invention also modifies the synthesis model based on an estimate of the current input signal to improve the perceptual quality of the speech and background noise under a variety of input conditions. The present invention also improves the voicing dependent spectral estimation algorithm robustness by introducing the use of a Multi-Layer Neural Network in the estimation process. The voicing dependent spectral estimation algorithm provides an accurate and robust estimate of the voicing probability under a variety of background noise conditions. This is essential to providing high quality intelligible speech in the presence of background noise.

Citations

37 Claims

1. A system for processing an audio signal comprising:
- means for dividing the audio signal into segments, each segment representing a portion of the audio signal occurring in one of a succession of time intervals;
  
  means for detecting for each segment the presence of a fundamental frequency;
  
  means responsive to the detecting means for determining the voicing probability for each segment by computing a ratio between voiced and unvoiced components of the audio signal, the determining means comprising;
  
  means for windowing each segment of the audio signal;
  
  means for computing the spectrum of the windowed segment;
  
  means for computing correlation coefficients of each segment using at least the spectrum;
  
  means for estimating a voicing threshold for each segment, comprising;
  
  means for dividing the spectrum into a plurality of non-linear bands, wherein the low bands of the spectrum have a higher resolution than the high bands of the spectrum;
  
  means for evaluating at least one voice measurement for each of the plurality of bands; and
  
  means for determining the voicing threshold for each segment using the at least one voice measurement; and
  
  means for comparing the correlation coefficients with the voicing threshold for each segment;
  
  means for separating the signal in each segment into a voiced portion and an unvoiced portion on the basis of the voicing probability, wherein the voiced portion of the signal occupies the low end of the spectrum and the unvoiced portion of the signal occupies the high end of the spectrum for each segment; and
  
  means for separately encoding the voiced portion and the unvoiced portion of the audio signal.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1, wherein the audio signal is a speech signal and the means for determining the voicing probability further comprises means for refining the fundamental frequency of each segment using at least the spectrum of the windowed segment.
  - 3. The system of claim 1, wherein the means for computing the spectrum of the windowed segment comprises means for performing a Fast Fourier Transform (FFT) of the windowed segment.
  - 4. The system of claim 1, wherein the means for estimating the voicing threshold for each segment further comprises:
    - means for computing a low band energy of the spectrum;
      
      means for computing an energy ratio between the energy of the high and low bands of the spectrum of a current segment and a previous segment; and
      
      a multi-layer neural network classifier for receiving the at least one voice measurement, the low band energy, and the energy ratio, wherein the at least one voice measurement includes normalized correlation coefficients in the frequency domain.
  - 5. The system of claim 1, further comprising means for spectrally estimating the audio signal comprising:
    - means for calculating a complex spectrum for each segment by using a window based on the fundamental frequency;
      
      means for spectrally modeling each segment using at least the complex spectrum, the fundamental frequency, and the voicing probability to obtain line spectral frequencies (LSF) coefficients and a signal gain of each segment.
  - 6. The system of claim 5, wherein the means for calculating the complex spectrum comprises means for applying a Fast Fourier Transform to the windowed segment.

7. A system for processing an audio signal comprising:
- means for dividing the signal into segments, each segment representing a portion of the audio signal in one of a succession of time intervals;
  
  means for detecting for each segment the presence of a fundamental frequency;
  
  means responsive to the detecting means for determining the voicing probability for each segment by computing a ratio between voiced and unvoiced components of the audio signal, the determining means comprising;
  
  means for windowing each segment of the audio signal;
  
  means for computing the spectrum of the windowed segment;
  
  means for computing correlation coefficients of each segment using at least the spectrum;
  
  means for estimating a voicing threshold for each segment, comprising;
  
  means for dividing the spectrum into a plurality of non-linear bands, wherein the low bands of the spectrum have a higher resolution than the high bands of the spectrum;
  
  means for evaluating at least one voice measurement for each of the plurality of bands; and
  
  means for determining the voicing threshold for each segment using the at least one voice measurement; and
  
  means for comparing the correlation coefficients with the voiding threshold for each segment;
  
  means for calculating a complex spectrum for each segment by using a window based on the fundamental frequency;
  
  means for spectrally modeling each segment using at least the complex spectrum, the fundamental frequency, and the voicing probability to obtain line spectral frequencies (LSF) coefficients and a signal gain of each segment;
  
  means for separating the signal in each segment into a voiced portion and an unvoiced portion on the basis of the voicing probability, wherein the voiced portion of the signal occupies the low end of the spectrum and the unvoiced portion of the signal occupies the high end of the spectrum for each segment; and
  
  means for separately encoding the voiced portion and the unvoiced portion of the audio signal, wherein the means for separately encoding further includes means for computing LPC coefficients for a speech segment and means for transforming LPC coefficients into line spectral frequencies (LSF) coefficients corresponding to the LPC coefficients.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The system of claim 7, wherein the audio signal is a speech signal and the means for determining the voicing probability comprises means for refining the fundamental frequency of each segment using at least the spectrum of the windowed segment.
  - 9. The system of claim 7, wherein the means for computing the spectrum of the windowed segment comprises means for performing a Fast Fourier Transform (FFT) of the windowed segment.
  - 10. The system of claim 7, wherein the means for estimating the voicing threshold for each segment further comprises:
    - means for computing a low band energy of the spectrum;
      
      means for computing an energy ratio between the energy of the high and low bands of the spectrum of a current segment and a previous segment; and
      
      a multi-layer neural network classifier for receiving the the at least one voice measurement, the low band energy, and the energy ratio, wherein the at least one voice measurement includes normalized correlation coefficients in the frequency domain.
  - 11. The system of claim 7, wherein the means for calculating the complex spectrum comprises means for applying a Fast Fourier Transform to the windowed segment.

12. A system for processing an audio signal having a number of frames, the system comprising:
- an encoder comprising;
  
  first means for determining for each frame a ratio between voiced and unvoiced components of the audio signal on the basis of the fundamental frequency of each frame, the ratio being defined as a voicing probability, the means for determining the voicing probability comprising;
  
  means for windowing each frame of the input signal;
  
  means for computing the spectrum of the windowed frame;
  
  means for computing correlation coefficients of each frame using at least the spectrum; and
  
  means for comparing the correlation coefficients with a voicing threshold for each segment;
  
  second means for determining at least a pitch period, a mid-frame pitch period, and a mid-frame voicing probability of the audio signal; and
  
  means for quantizing at least the pitch period, the voicing probability, the mid-frame pitch period, and the mid-frame voicing probability.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 13. The system of claim 12, wherein further comprising means for high-pass filtering the audio signal and buffering the audio signal into the number of frames.
  - 14. The system of claim 12, wherein the encoder further comprises spectral estimation means for computing an estimate of the power spectrum of the audio signal using a pitch adaptive window.
  - 15. The system of claim 14, wherein the length of the pitch adaptive window is based on the fundamental frequency of the audio signal.
  - 16. The system of claim 12, further comprising:
    - means for calculating a complex spectrum for each segment by using a window based on the fundamental frequency; and
      
      means for spectrally modeling each segment using at least the complex spectrum, the fundamental frequency, and the voicing probability to obtain line spectral frequencies (LSF) coefficients and a signal gain of each segment.
  - 17. The system of claim 16, wherein the means for calculating the complex spectrum comprises means for applying a Fast Fourier Transform to the windowed segment.
  - 18. The system of claim 12, further comprising means for estimating the voicing threshold for each segment comprising:
    - means for dividing the spectrum into a plurality of non-linear bands, where the low bands of the spectrum have a higher resolution than the high bands of the spectrum;
      
      means for evaluating at least one voice measurement for each of the plurality of bands, where the at least one voice measurement is the normalized correlation coefficients calculated in the frequency domain;
      
      means for computing the low band energy of the spectrum;
      
      means for computing an energy ratio between the energy of the high and low bands of the spectrum of a current segment and a previous segment; and
      
      means for receiving the normalized correlation coefficients of the low bands, the low band energy and the energy ratio.
  - 19. The system of claim 18, wherein the means for receiving is a multi-layer neural network classifier.
  - 20. The system of claim 19, wherein the voicing probability is zero if an output from the means for receiving is less than a predetermined threshold for a predetermined number of frames.
  - 21. The system of claim 12, further comprising a decoder comprising:
    - means for unquantizing at least the pitch period, the voicing probability, the mid-frame pitch period, and/or the mid-frame voicing probability and providing at least one output; and
      
      means for analyzing the at least one output to produce a synthetic speech signal corresponding to the input audio signal.
  - 22. The system of claim 21, wherein the means for unquantizing comprises:
    - means for producing a spectral magnitude envelope and a minimum phase envelope using at least the unquantized pitch period, the unquantized voicing probability, the unquantized mid-frame pitch period, and/or the unquantized mid-frame voicing probability;
      
      means for interpolating and outputting the spectral magnitude envelope and the minimum phase envelope to the means for analyzing;
      
      means for estimating the signal-to-noise ratio of the audio signal using the at least the unquantized pitch period, the unquantized voicing probability, the unquantized mid-frame pitch period, and/or the unquantized mid-frame voicing probability; and
      
      means for generating at least one control parameter using at least the signal-to-noise ratio and for outputting the at least one control parameter to the means for analyzing.
  - 23. The system of claim 21, wherein the means for analyzing comprises:
    - first means for processing the at least one output to produce a time-domain signal; and
      
      second means for processing the time-domain signal to produce the synthetic speech signal corresponding to the audio signal.
  - 24. The system of claim 23, wherein the first means for processing the at least one output to produce the time-domain signal comprises:
    - means for filtering a spectral magnitude envelope, wherein the spectral magnitude envelope is outputted by the means for unquantizing;
      
      means for calculating frequencies and amplitudes using at least the filtered spectral magnitude envelope;
      
      means for calculating sine-wave phases using at least the calculated frequencies; and
      
      means for calculating a sum of sinusoids using at least the calculated frequencies and amplitudes and the sine-wave phases to produce the time-domain signal.

25. A system for processing an audio signal having a number of frames, the system comprising:
- an encoder comprising;
  
  means for determining for each frame a ratio between voiced and unvoiced components of the audio signal on the basis of the fundamental frequency of each frame, the ratio being defined as a voicing probability;
  
  means for calculating a complex spectrum for each segment by using a window based on the fundamental frequency;
  
  means for spectrally modeling each segment using at least the complex spectrum, the fundamental frequency, and the voicing probability to obtain line spectral frequencies (LSF) coefficients and a signal gain of each segment;
  
  means for determining at least a pitch period, a mid-frame pitch period, and a mid-frame voicing probability of the audio signal; and
  
  means for quantizing at least the pitch period, the voicing probability, the mid-frame pitch period, and the mid-frame voicing probability.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
- - 26. The system of claim 25, further comprising means for high-pass filtering the audio signal and buffering the audio signal into the number of frames.
  - 27. The system of claim 25, wherein the encoder further comprises spectral estimation means for computing an estimate of the power spectrum of the audio signal using a pitch adaptive window.
  - 28. The system of claim 27, wherein the length of the pitch adaptive window is based on the fundamental frequency of the audio signal.
  - 29. The system of claim 25, further comprising means for estimating the voicing threshold for each segment comprising:
    - means for dividing the spectrum into a plurality of non-linear bands, where the low bands of the spectrum have a higher resolution than the high bands of the spectrum;
      
      means for evaluating at least one voice measurement for each of the plurality of bands, where the at least one voice measurement is the normalized correlation coefficients calculated in the frequency domain;
      
      means for computing the low band energy of the spectrum;
      
      means for computing an energy ratio between the energy of the high and low bands of the spectrum of a current segment and a previous segment; and
      
      means for receiving the normalized correlation coefficients of the low bands, the low band energy and the energy ratio.
  - 30. The system of claim 29, wherein the means for receiving is a multi-layer neural network classifier.
  - 31. The system of claim 30, wherein the voicing probability is zero if an output from the means for receiving is less than a predetermined threshold for a predetermined number of frames.
  - 32. The system of claim 25, wherein the means for determining the voicing probability comprises:
    - means for windowing each frame of the input signal;
      
      means for computing the spectrum of the windowed frame;
      
      means for computing correlation coefficients of each frame using at least the spectrum; and
      
      means for comparing the correlation coefficients with a voicing threshold for each segment.
  - 33. The system of claim 25, wherein the means for calculating the complex spectrum comprises means for applying a Fast Fourier Transform to the windowed segment.
  - 34. The system of claim 25, further comprising a decoder comprising:
    - means for unquantizing at least the pitch period, the voicing probability, the mid-frame pitch period, and/or the mid-frame voicing probability and providing at least one output; and
      
      means for analyzing the at least one output to produce a synthetic speech signal corresponding to the input audio signal.
  - 35. The system of claim 34, wherein the means for unquantizing comprises:
    - means for producing a spectral magnitude envelope and a minimum phase envelope using at least the unquantized pitch period, the unquantized voicing probability, the unquentized mid-frame pitch period, and/or the unquantized mid-frame voicing probability;
      
      means for interpolating and outputting the spectral magnitude envelope and the minimum phase envelope to the means for analyzing;
      
      means for estimating the signal-to-noise ratio of the audio signal using the at least the unquantized pitch period, the unquantized voicing probability, the unquantized mid-frame pitch period, and/or the unquantized mid-frame voicing probability; and
      
      means for generating at least one control parameter using at least the signal-to-noise ratio and for outputting the at least one control parameter to the means for analyzing.
  - 36. The system of claim 34, wherein the means for analyzing comprises:
    - first means for processing the at least one output to produce a time-domain signal; and
      
      second means for processing the time-domain signal to produce the synthetic speech signal corresponding to the audio signal.
  - 37. The system of claim 36, wherein the first means for processing the at least one output to produce the time-domain signal comprises:
    - means for filtering a spectral magnitude envelope, wherein the spectral magnitude envelope is outputted by the means for unquantizing;
      
      means for calculating frequencies and amplitudes using at least the filtered spectral magnitude envelope;
      
      means for calculating sine-wave phases using at least the calculated frequencies; and
      
      means for calculating a sum of sinusoids using at least the calculated frequencies and amplitudes and the sine-wave phases to produce the time-domain signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Wang, Wei, Chen, Juin-Hwey, Zopf, Robert W., Aguilar, Joseph Gerard
Primary Examiner(s)
Young, W. R.
Assistant Examiner(s)
Opsasnick, Michael N.

Application Number

US09/625,960
Time in Patent Office

2,211 Days
Field of Search

704/233, 704/207, 704/208, 704/214, 704/219
US Class Current

704/233
CPC Class Codes

G10L 19/093   using sinusoidal excitation...

G10L 19/265   Pre-filtering, e.g. high fr...

G10L 21/0272   Voice signal separating

G10L 25/18   the extracted parameters be...

G10L 25/30   using neural networks

G10L 25/90   Pitch determination of spee...

G10L 25/93   Discriminating between voic...

Parametric speech codec for representing synthetic speech in the presence of background noise

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Parametric speech codec for representing synthetic speech in the presence of background noise

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links