Parametric speech codec for representing synthetic speech in the presence of background noise
First Claim
1. A system for processing an audio signal comprising:
- means for dividing the audio signal into segments, each segment representing a portion of the audio signal occurring in one of a succession of time intervals;
means for detecting for each segment the presence of a fundamental frequency;
means responsive to the detecting means for determining the voicing probability for each segment by computing a ratio between voiced and unvoiced components of the audio signal, the determining means comprising;
means for windowing each segment of the audio signal;
means for computing the spectrum of the windowed segment;
means for computing correlation coefficients of each segment using at least the spectrum;
means for estimating a voicing threshold for each segment, comprising;
means for dividing the spectrum into a plurality of non-linear bands, wherein the low bands of the spectrum have a higher resolution than the high bands of the spectrum;
means for evaluating at least one voice measurement for each of the plurality of bands; and
means for determining the voicing threshold for each segment using the at least one voice measurement; and
means for comparing the correlation coefficients with the voicing threshold for each segment;
means for separating the signal in each segment into a voiced portion and an unvoiced portion on the basis of the voicing probability, wherein the voiced portion of the signal occupies the low end of the spectrum and the unvoiced portion of the signal occupies the high end of the spectrum for each segment; and
means for separately encoding the voiced portion and the unvoiced portion of the audio signal.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method are provided for processing audio and speech signals using a pitch and voicing dependent spectral estimation algorithm (voicing algorithm) to accurately represent voiced speech, unvoiced speech, and mixed speech in the presence of background noise, and background noise with a single model. The present invention also modifies the synthesis model based on an estimate of the current input signal to improve the perceptual quality of the speech and background noise under a variety of input conditions. The present invention also improves the voicing dependent spectral estimation algorithm robustness by introducing the use of a Multi-Layer Neural Network in the estimation process. The voicing dependent spectral estimation algorithm provides an accurate and robust estimate of the voicing probability under a variety of background noise conditions. This is essential to providing high quality intelligible speech in the presence of background noise.
-
Citations
37 Claims
-
1. A system for processing an audio signal comprising:
-
means for dividing the audio signal into segments, each segment representing a portion of the audio signal occurring in one of a succession of time intervals; means for detecting for each segment the presence of a fundamental frequency; means responsive to the detecting means for determining the voicing probability for each segment by computing a ratio between voiced and unvoiced components of the audio signal, the determining means comprising; means for windowing each segment of the audio signal; means for computing the spectrum of the windowed segment; means for computing correlation coefficients of each segment using at least the spectrum; means for estimating a voicing threshold for each segment, comprising; means for dividing the spectrum into a plurality of non-linear bands, wherein the low bands of the spectrum have a higher resolution than the high bands of the spectrum; means for evaluating at least one voice measurement for each of the plurality of bands; and means for determining the voicing threshold for each segment using the at least one voice measurement; and means for comparing the correlation coefficients with the voicing threshold for each segment; means for separating the signal in each segment into a voiced portion and an unvoiced portion on the basis of the voicing probability, wherein the voiced portion of the signal occupies the low end of the spectrum and the unvoiced portion of the signal occupies the high end of the spectrum for each segment; and means for separately encoding the voiced portion and the unvoiced portion of the audio signal. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for processing an audio signal comprising:
-
means for dividing the signal into segments, each segment representing a portion of the audio signal in one of a succession of time intervals; means for detecting for each segment the presence of a fundamental frequency; means responsive to the detecting means for determining the voicing probability for each segment by computing a ratio between voiced and unvoiced components of the audio signal, the determining means comprising; means for windowing each segment of the audio signal; means for computing the spectrum of the windowed segment; means for computing correlation coefficients of each segment using at least the spectrum; means for estimating a voicing threshold for each segment, comprising; means for dividing the spectrum into a plurality of non-linear bands, wherein the low bands of the spectrum have a higher resolution than the high bands of the spectrum; means for evaluating at least one voice measurement for each of the plurality of bands; and means for determining the voicing threshold for each segment using the at least one voice measurement; and means for comparing the correlation coefficients with the voiding threshold for each segment; means for calculating a complex spectrum for each segment by using a window based on the fundamental frequency; means for spectrally modeling each segment using at least the complex spectrum, the fundamental frequency, and the voicing probability to obtain line spectral frequencies (LSF) coefficients and a signal gain of each segment; means for separating the signal in each segment into a voiced portion and an unvoiced portion on the basis of the voicing probability, wherein the voiced portion of the signal occupies the low end of the spectrum and the unvoiced portion of the signal occupies the high end of the spectrum for each segment; and means for separately encoding the voiced portion and the unvoiced portion of the audio signal, wherein the means for separately encoding further includes means for computing LPC coefficients for a speech segment and means for transforming LPC coefficients into line spectral frequencies (LSF) coefficients corresponding to the LPC coefficients. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A system for processing an audio signal having a number of frames, the system comprising:
an encoder comprising; first means for determining for each frame a ratio between voiced and unvoiced components of the audio signal on the basis of the fundamental frequency of each frame, the ratio being defined as a voicing probability, the means for determining the voicing probability comprising; means for windowing each frame of the input signal; means for computing the spectrum of the windowed frame; means for computing correlation coefficients of each frame using at least the spectrum; and means for comparing the correlation coefficients with a voicing threshold for each segment; second means for determining at least a pitch period, a mid-frame pitch period, and a mid-frame voicing probability of the audio signal; and means for quantizing at least the pitch period, the voicing probability, the mid-frame pitch period, and the mid-frame voicing probability. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
25. A system for processing an audio signal having a number of frames, the system comprising:
an encoder comprising; means for determining for each frame a ratio between voiced and unvoiced components of the audio signal on the basis of the fundamental frequency of each frame, the ratio being defined as a voicing probability; means for calculating a complex spectrum for each segment by using a window based on the fundamental frequency; means for spectrally modeling each segment using at least the complex spectrum, the fundamental frequency, and the voicing probability to obtain line spectral frequencies (LSF) coefficients and a signal gain of each segment; means for determining at least a pitch period, a mid-frame pitch period, and a mid-frame voicing probability of the audio signal; and means for quantizing at least the pitch period, the voicing probability, the mid-frame pitch period, and the mid-frame voicing probability. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
Specification