Wear-toll quality 4.8 kbps speech codec
First Claim
1. An apparatus for encoding an input speech signal into a plurality of coded signal portions, said apparatus including first means responsive to said input speech signal for generating at least a first coded signal portion of said plurality of coded signal portions and second means responsive to said input speech signal and to at least said first coded signal portion for generating at least a second coded signal portion of said plurality of coded signal portions, said first means comprising iterative optimization means for(1) determining an optimum value for said first coded signal portion assuming no excitation signal, and providing a corresponding first output,(2) determining an optimum value for said second coded signal portion based on said first output and providing a corresponding second output,(3) determining a new optimum value for said first coded signal portion assuming said second output as an excitation signal, and providing a corresponding new first output,(4) determining a new optimum value for said second coded value based on said new first output, and providing a corresponding new second output, and(5) repeating steps (3) and (4) until said first and second coded signal portions are optimized.
2 Assignments
0 Petitions
Accused Products
Abstract
A speech codec operating at low data rates uses an iterative method to jointly optimize pitch and gain parameter sets. A 26-bit spectrum filter coding scheme may be used, involving successive subtractions and quantizations. The codec may preferably use a decomposed multipulse excitation model, wherein the multipulse vectors used as the excitation signal are decomposed into position and amplitude codewords. Multipulse vectors are coded by comparing each vector to a reference multipulse vector and quantizing the resulting difference vector. An expanded multipulse excitation codebook and associated fast search method, optionally with a dynamically-weighted distortion measure, allow selection of the best excitation vector without memory or computational overload. In a dynamic bit allocation technique, the number of bits allocated to the pitch and excitation signals depend on whether the signals are "significant" or "insignificant". Silence/speech detection is based on an average signal energy over an interval and a minimum average energy over a predetermined number of intervals. Adaptive post-filter and the automatic gain control schemes are also provided. Interpolation is used for spectrum filter smoothing, and an algorithm is provided for ensuring stability of the spectrum filter. Specially designed scalar quantizers are provided for the pitch gain and excitation gain.
158 Citations
42 Claims
-
1. An apparatus for encoding an input speech signal into a plurality of coded signal portions, said apparatus including first means responsive to said input speech signal for generating at least a first coded signal portion of said plurality of coded signal portions and second means responsive to said input speech signal and to at least said first coded signal portion for generating at least a second coded signal portion of said plurality of coded signal portions, said first means comprising iterative optimization means for
(1) determining an optimum value for said first coded signal portion assuming no excitation signal, and providing a corresponding first output, (2) determining an optimum value for said second coded signal portion based on said first output and providing a corresponding second output, (3) determining a new optimum value for said first coded signal portion assuming said second output as an excitation signal, and providing a corresponding new first output, (4) determining a new optimum value for said second coded value based on said new first output, and providing a corresponding new second output, and (5) repeating steps (3) and (4) until said first and second coded signal portions are optimized.
-
4. A speech analysis and synthesis method comprising the steps of deriving a set of predictor coefficients for each analysis time period from an original input signal having a plurality of successive analysis time periods, coding said predictor coefficients to obtain a coded representation of said coefficients, transmitting the coded representation of said predictor coefficients to a decoder and synthesizing the original input speech signal in accordance with said transmitted coded representation of said predictor coefficients, said coding step comprising:
-
transforming said set of predictor coefficients for one analysis time period into parameters in a parameter st to form a parameter vector; subtracting from said parameter vector a mean vector determined in advance from a large speech data base to obtain an adjusted parameter vector; selecting from a codebook of 2L entries (where L is an integer), prepared in advance from said large speech data base, a prediction matrix A such that
space="preserve" listing-type="equation">F.sub.n =AF.sub.n-1where n is an integer, Fn is a predicted parameter vector for said one analysis time period and Fn-1 is the adjusted parameter vector for an immediately preceding analysis time period; calculating a predicted parameter vector for said one analysis time period as well as a residual parameter vector comprising the difference between said predicted parameter vector and said adjusted parameter vector; quantizing said residual parameter vector in a first stage vector quantizer by selecting one of 2M (where M is an integer) first quantization vectors to obtain an intermediate quantized vector; calculating a residual quantized vector comprising the difference between said intermediate quantized vector and said residual parameter vector; quantizing said residual quantized vector in a second stage vector quantizer by selecting one of 2N (where N is an integer) second quantization vectors to obtain a final quantized vector; and forming said transmitted coded representation of said predictor coefficients by combining an L-bit value representing the prediction matrix A, an M-bit value representing said intermediate quantized vector and an N-bit value representing said final quantized vector. - View Dependent Claims (5, 6)
-
-
7. A speech analysis and synthesis method comprising the steps of deriving a set of predictor coefficients for each analysis time period from an original input signal having a plurality of successive analysis time periods, coding said predictor coefficients to obtain a coded representation of said coefficients, transmitting the coded representation of said predictor coefficients to a decoder and synthesizing the original input speech signal in accordance with said transmitted coded representation of said predictor coefficients, said coding step comprising:
-
generating a multi-component input vector corresponding to said set of predictor coefficients for one analysis time period, with each component of said vector corresponding to a frequency; quantizing said input vector by selecting a plurality of multi-component quantization vectors from a quantization vector storage means and calculating for each selected quantization vector a distortion measure in accordance with the difference between each component of said input vector and each corresponding component of the selected quantization vector, and in accordance with a weighting factor associated with each component of said input vector, the weighting factor being determined for each component of said input vector in accordance with the frequency to which said component corresponds; selecting as a quantizer output the one of said plurality of selected quantization vectors resulting in the least distortion measure; and generating said transmitted coded representation in accordance with the selected quantizer output. - View Dependent Claims (8, 9)
-
-
10. A speech analysis and synthesis system comprising:
-
excitation signal generating means for generating for each of a plurality of analysis time periods of an input speech signal a multipulse excitation signal comprising a sequence of excitation pulses each having an amplitude and a position within said analysis time period, said excitation signal generating means comprising; means for storing a plurality of pulse amplitude codewords; means for storing a plurality of pulse position codewords; and means for reading a pulse amplitude codeword and a pulse position codeword to form said multipulse excitation pulse; and means for subsequently regenerating said speech signal in accordance with said multipulse excitation signals.
-
-
11. A speech analysis and synthesis method comprising the steps of:
-
generating for each of a plurality of analysis time periods of an input speech signal a multipulse excitation vector representing a sequence of excitation pulses each having an amplitude and a position within said analysis time period, said generating step comprising; selecting a pulse position codeword from a stored plurality of pulse position codewords; selecting a pulse amplitude codeword from a stored plurality of pulse amplitude codewords; and combining said selected pulse position and pulse amplitude codewords to form said multipulse excitation vector; and subsequently regenerating said speech signal in accordance with said multipulse excitation vector. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
-
19. A speech analysis and synthesis method comprising the steps of:
-
generating for each of a plurality of analysis time periods of an input speech signal a multipulse excitation vector representing a sequence of excitation pulses each having an amplitude and a position within said analysis time period, coding said multipulse excitation vectors, wherein said coding step comprises; generating for each multipulse excitation vector a difference excitation vector which is a function of the difference between said each multipulse excitation vector and a reference multipulse excitation vector; and quantizing said difference excitation vector to obtain said coded multipulse excitation vectors; decoding the coded multipulse excitation vectors; and subsequently regenerating said speech signal in accordance with decoded multipulse excitation vectors. - View Dependent Claims (20, 21, 22, 23)
-
-
24. A speech analysis and synthesis method comprising the steps of:
-
generating for each of a plurality of analysis time periods of an input speech signal a vector representing a sequence of excitation pulses each having an amplitude and a position within said analysis time period, each of said vectors being of the form V=(m1, . . . , mL, g1, . . . , gL), where L is the total number of excitation pulses represented by said vector, mi and gi, 1≦
i≦
L, are position-related and amplitude-related terms, respectively, corresponding to the i-th excitation pulse in said vector;coding said vectors, wherein said coding step comprises separating said vector into a position subvector (m1, . . . , mL) and an amplitude subvector (g1, . . . , gL), and then quantizing said position subvector in a first quantizer and quantizing said amplitude subvector in a second quantizer, with the quantized position subvector and quantized amplitude subvector together comprising said coded vector; decoding the coded vectors; and subsequently regenerating said speech signal in accordance with decoded vectors.
-
-
25. A speech analysis and synthesis method comprising the steps of:
-
generating, for each of a plurality of analysis time periods of an input speech signal, a vector representing a sequence of excitation pulses each having an amplitude and a position within said analysis time period, each said vector being is of the form V=(m1, . . . , mL, g1, . . . , gL), where L is the total number of excitation pulses represented by said vector, mi and gi, 1≦
i≦
L, are position-related and amplitude-related terms, respectively, corresponding to the i-th excitation pulse in said vector;coding said vectors, wherein said coding step comprises; generating from a given one of said vectors a position reference subvector Vm and an amplitude reference subvector vector Vg ; selecting from a position codebook a plurality of position codewords in accordance with said position reference subvector; selecting from an amplitude codebook a plurality of amplitude codewords in accordance with said amplitude reference subvector; generating a plurality of position codeword/amplitude codeword pairs from various combinations of said selected position and amplitude codewords; calculating a distortion measure between said given vector and each position codeword/amplitude codeword pair; and selecting a position codeword/amplitude codeword pair resulting in the lowest distortion measure as a coded version of said given vector; decoding the coded vectors; and subsequently regenerating said speech signal in accordance with decoded vectors. - View Dependent Claims (26, 27)
-
-
28. A speech analysis and synthesis method comprising the steps of:
-
generating a plurality of analysis signals from an input signal, said analysis signals comprising at least a pitch signal portion including a pitch value and a pitch gain value, and an excitation signal portion including an excitation codeword and an excitation gain signal; coding said analysis signals, wherein said coding step includes the steps of; classifying each of said pitch signal portions and excitation signal portions as significant or insignificant; allocating a number of coding bits to each of said pitch signal portions and excitation signal portions in accordance with results of said classifying step; and coding each of said pitch and excitation signals with the number of bits allocated to each; and decoding said analysis signals; and synthesizing said coded speech signal in accordance with the decoded analysis signals. - View Dependent Claims (29, 30)
-
-
31. A speech activity detector for use in an apparatus for encoding an input signal having speech and non-speech portions, for determining the speech or non-speech character of said input signal over each of a plurality of successive intervals, said speech activity detector comprising monitoring means for monitoring an energy content of said input speech signal and discriminating means responsive to the monitored energy for discriminating between speech and non-speech input signals, said monitoring means comprising means for determining an average energy of said input signal over one of said intervals and means for determining a minimum value of said average energy over a predetermined number of said intervals;
- and said discriminating means comprising means for determining a threshold value in accordance with said minimum value and means for comparing said average energy of said input signal over said one interval to said threshold value to determine if said input signal during said one interval represents speech or non-speech.
- View Dependent Claims (32, 33)
-
34. A speech detector for discriminating between speech and non-speech intervals of an input signal, said speech detector comprising monitoring means for monitoring at least one characteristic of said input signal and discriminating means responsive to said monitoring means for discriminating between speech and non-speech input signals, wherein said monitoring means comprises first means for determining if said one characteristic of said input signal for a present interval meets at least a first criterion of a signal representing speech and wherein said discriminating means comprises second means responsive to a determination of speech by said first means for setting a predetermined hangover time in accordance with a number of consecutive intervals for which said input signal has been determined to satisfy said first criterion, and third means responsive to a determination by said first means that said input signal does not satisfy said criterion for determining non-speech in accordance with a number of consecutive intervals for which said criterion has not been satisfied and in accordance with the hangover time set by said second means.
-
35. A speech analysis and synthesis method comprising the steps of:
-
deriving a set of synthesis parameters for each frame from an original input signal having a plurality of successive frames including a current frame, a previous frame and a next frame, with each frame having first, second and third portions, said step of deriving said synthesis parameters comprising; generating a set of first parameters corresponding to each frame of said input signal, each set of first parameters for a given frame including first, second and third subsets corresponding to said first, second and third portions of the given frame; generating an interpolated first subset of parameters by interpolating between said first subsets of said current and previous frames; generating an interpolated third subset of parameters by interpolating between said third subsets of said current and next frames; combining said interpolated first subset, said second subset and said interpolated third subset of parameters to form a set of synthesis parameters for said current frame; transmitting the synthesis parameters to a decoder; and synthesizing the original input speech signal in accordance with said transmitted synthesis parameters. - View Dependent Claims (36)
-
-
37. A speech analysis and synthesis method, comprising:
-
deriving a set of spectrum filter coefficients for each frame from an original input signal representing speech and having a plurality of successive frames; converting said spectrum filter coefficients to an ordered set of n frequency parameters (f1, f2, . . . , fn), where n is an integer; determining if any magnitude ordering has been violated, i.e., if fi <
fi-1, where i is an integer between 1 and n;if any magnitude ordering has been violated, rearranging said frequency parameters by reversing the order of the two frequencies fi and fi-1 which resulted in the violation; converting said frequency parameters, after any rearrangement if that has occurred, back to spectrum filter coefficients; and synthesizing said original input signal representing said speech in accordance with the spectrum filter coefficients resulting from said converting step. - View Dependent Claims (38)
-
-
39. A speech analysis and synthesis method comprising the steps of:
-
generating a plurality of analysis signals from an input signal, said analysis signals comprising at least a pitch value, a pitch gain value, an excitation codeword and an excitation gain signal, quantizing said analysis signals, wherein said quantizing step comprises; quantizing said pitch value directly by classifying said pitch value into one of a plurality of 2m value ranges, where m is an integer, with m quantization bits representing the classification value; and quantizing said pitch gain by selecting a corresponding codeword from a codebook of 2n codewords, where n is an integer, with n quantization bits representing the selected codeword; providing the quantized analysis signals to a decoder, and synthesizing said speech signal in accordance with the quantized signals at the decoder. - View Dependent Claims (40, 41, 42)
-
Specification