Adaptive windows for analysis-by-synthesis CELP-type speech coding
First Claim
1. A method for coding a speech signal, comprising steps of:
- partitioning samples of the speech signal into frames;
classifying the speech signal in each frame into one of a plurality of classes, wherein the step of classifying classifies a frame as being one of an unvoiced frame or a not unvoiced frame and classifies said not unvoiced frame as being one of a voiced frame or a transition frame;
determining the location of at least one window in the frame; and
encoding an excitation for the frame, whereby all or substantially all of non-zero excitation amplitudes lie within the at least one window.
5 Assignments
0 Petitions
Accused Products
Abstract
A speech coder and a method for speech coding wherein the speech signal is represented by an excitation signal applied to a synthesis filter. The speech is partitioned into frames and subframes. A classifier identifies which of several categories the speech frame belongs to, and a different coding method is applied to represent the excitation for each category. For some categories, one or more windows are identified for the frame where all or most of the excitation signal samples are assigned by a coding scheme. Performance is enhanced by coding the important segments of the excitation more accurately. The window locations are determined from a linear prediction residual by identifying peaks of the smoothed residual energy contour. The method adjusts the frame and subframe boundaries so that each window is located entirely within a modified subframe or frame. This eliminates the artificial restriction incurred when coding a frame or subframe in isolation, without regard for the local behavior of the speech signal across frame or subframe boundaries.
161 Citations
38 Claims
-
1. A method for coding a speech signal, comprising steps of:
-
partitioning samples of the speech signal into frames;
classifying the speech signal in each frame into one of a plurality of classes, wherein the step of classifying classifies a frame as being one of an unvoiced frame or a not unvoiced frame and classifies said not unvoiced frame as being one of a voiced frame or a transition frame;
determining the location of at least one window in the frame; and
encoding an excitation for the frame, whereby all or substantially all of non-zero excitation amplitudes lie within the at least one window. - View Dependent Claims (2, 3, 4)
deriving a residual signal for each frame; and
smoothing an energy contour of the residual signal;
wherein the location of the at least one window is determined by examining the smoothed energy contour of the residual signal.
-
-
4. A method as in claim 1, wherein the at least one window can be located so as to have an edge that coincides with at least one of a subframe boundary or a frame boundary.
-
5. A method for coding a speech signal, comprising the steps of:
-
partitioning samples of the speech signal into frames;
classifying the speech signal in each frame into one of a plurality of classes, wherein the step of classifying classifies a frame as being one of an unvoiced frame or a not unvoiced frame and classifies said not unvoiced frame as being one of a voiced frame or a transition frame;
deriving a residual signal for each frame;
determining a location of at least one window, whose center lies within the frame, by considering the residual signal for the frame; and
encoding an excitation for the frame whereby all or substantially all of non-zero excitation amplitudes lie within the at least one window. - View Dependent Claims (6, 7)
-
-
8. A method for coding a speech signal, comprising steps of:
-
partitioning samples of the speech signal into frames;
deriving a residual signal for each frame;
classifying the speech signal in each frame into one of a plurality of classes, wherein the step of classifying classifies a frame as being one of an unvoiced frame or a not unvoiced frame and classifies said not unvoiced frame as being one of a voiced frame or a transition frame;
identifying the location of at least one window in the frame by examining the residual signal for the frame;
encoding an excitation for the frame using one of a plurality of excitation coding techniques selected according to the class of the frame; and
for at least one of the classes, confining all or substantially all of non-zero excitation amplitudes to lie within the windows. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
forming a smoothed energy contour from the residual signal; and
considering a location of peaks in the smoothed energy contour.
-
-
12. A method as in claim 8, wherein one of the plurality of codebooks is comprised of an adaptive codebook.
-
13. A method as in claim 8, wherein one of the plurality of codebooks is comprised of a fixed ternary pulse coding codebook.
-
14. A method as in claim 8, wherein the step of classifying uses an open loop classifier followed by a closed loop classifier.
-
15. A method as in claim 8, wherein the step of classifying uses a first classifier to classify said frame as being one of said unvoiced frame or said not unvoiced frame, and a second classifier for classifying said not unvoiced frame as being one of a voiced frame or a transition frame.
-
16. A method as in claim 8, wherein the step of encoding comprises steps of:
-
partitioning the frame into a plurality of subframes; and
positioning at least one window within each subframe.
-
-
17. A method as in claim 16, wherein the step of positioning at least one window positions a first window at a location that is a function of a pitch of the frame, and positions subsequent windows as a function of the pitch of the frame and as a function of the position of the first window.
-
18. A method as in claim 8, wherein the step of identifying the location of at least one window includes a step of smoothing the residual signal, and wherein the step of identifying considers the presence of energy peaks in the smoothed contour of the residual signal.
-
19. A method for coding a speech signal, comprising steps of:
-
partitioning samples of the speech signal into frames;
classifying the speech signal in each frame into one of a plurality of classes, wherein the step of classifying classifies a frame as being one of an unvoiced frame or a not unvoiced frame and classifies said not unvoiced frame as being one of a voiced frame or a transition frame;
modifying the duration and boundaries of a frame or a subframe by considering the speech or residual signal for the frame; and
encoding an excitation for the frame using an analysis-by-synthesis coding technique. - View Dependent Claims (20)
-
-
21. Apparatus for coding speech, comprising:
-
a framing unit for partitioning samples of an input speech signal into frames;
a first classifier for classifying a frame as being one of an unvoiced frame or a not unvoiced frame and a second classifier for classifying said not unvoiced frame as being one of a voiced frame or a transition frame;
a windowing unit for determining the location of at least one window in a frame; and
an encoder for encoding an excitation for the frame such that all or substantially all of non-zero excitation amplitudes lie within the at least one window. - View Dependent Claims (22, 23, 24)
a unit for deriving a residual signal for each frame; and
a unit for smoothing an energy contour of the residual signal;
wherein said windowing unit determines the location of the at least one window by examining the smoothed energy contour of the residual signal.
-
-
24. Apparatus as in claim 21, wherein said windowing unit is operative for locating said at least one window so as to have an edge that coincides with at least one of a subframe boundary or a frame boundary.
-
25. A wireless voice communicator, comprising;
-
a wireless transceiver comprising a transmitter and a receiver;
an input speech transducer and an output speech transducer; and
a speech processor comprising, a sampling and framing unit having an input coupled to an output of said input speech transducer for partitioning samples of an input speech signal into frames;
a first classifier for classifying a frame as being one of an unvoiced frame or a not unvoiced frame and a second classifier for classifying said not unvoiced frame as being one or a voiced frame or a transition frame;
a windowing unit for determining the location of at least one window in a frame; and
an encoder for providing an encoded speech signal where, in an excitation for the frame, all or substantially all of non-zero excitation amplitudes lie within the at least one window;
said wireless communicator further comprising a modulator for modulating a carrier with the encoded speech signal, said modulator having an output coupled to an input of said transmitter;
a demodulator having an input coupled to an output of said receiver for demodulating a carrier that is encoded with a speech signal and that was transmitted from a remote transmitter; and
said speech processor further comprising a decoder having an input coupled to an output of said demodulator for decoding an excitation from a frame wherein all or substantially all of non-zero excitation amolitudes lie within at least one window, said decoder having an output coupled to an input of said output speech transducer. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33)
a unit for deriving a residual signal for each frame; and
a unit for smoothing an energy contour of the residual signal;
wherein said windowing unit determines the location of the at least one window by examining the smoothed energy contour of the residual signal.
-
-
28. A wireless communicator as in claim 25, wherein said windowing unit is operative for locating said at least one window so as to have an edge that coincides with at least one of a subframe boundary or a frame boundary.
-
29. A wireless communicator as in claim 25, wherein said speech processor further comprises a unit for modifying the duration and boundaries of a frame or a subframe by considering the speech or residual signal for the frame;
- and wherein said encoder encodes an excitation for the frame using an analysis-by-synthesis coding technique.
-
30. A wireless communicator as in claim 25, wherein a frame is comprised of at least two subframes, and wherein said windowing unit operates such that a subframe boundary or a frame boundary is modified so that the window lies entirely within the modified subframe or frame, and the boundary is located so as to have an edge of the modified frame or subframe coincide with a window boundary.
-
31. A wireless communicator as in claim 25, wherein said windowing unit operates such that windows are centered at epochs, wherein epochs of voiced frames are separated by a predetermined distance plus or minus a jitter value, wherein said modulator further modulates said carrier with an indication of the jitter value, and wherein said demodulator further demodulates the received carrier to obtain the jitter value for the received frame.
-
32. A wireless communicator as in claim 31, wherein the predetermined distance is one pitch period, and wherein the jitter value is an integer between about −
- 8 and about +7.
-
33. A wireless communicator as in claim 25, wherein said encoder and said decoder operate at a data rate of less than about 4 kb/sec.
-
34. A speech decoder, comprising:
-
a class decoder having an input coupled to an input node of said speech decoder for extracting from an input bit stream predetermined ones of bits encoding class information for an encoded speech signal frame and for decoding the class information, wherein there are a plurality of predetermined classes;
said plurality of predetermined classes comprises a voiced class, an unvoiced class and a transition class; and
wherein said input bit stream is also coupled to an input of a LSP decoder;
a first multi-position switch element controlled by an output of said class decoder for directing said input bit stream to an input of one of selected one of a plurality of excitation generators, an individual one of said excitation generators corresponding to one of said plurality of predetermined classes;
a second multi-position switch element controlled by said output of said class decoder for coupling an output of the selected one of said excitation generators to an input of a synthesizer filter and, via a feedback path, also to said adaptive code book;
an unvoiced class excitation generator and a transition class excitation generator coupled between said first and second multi-position switch elements;
wherein for said transition class, at least one window position is decoded in a window decoder having an input coupled to said input bit stream; and
wherein a codebook vector is retrieved from a transition excitation fixed codebook using information concerning the at least one window location output from said window decoder and by multiplying a retrieved codebook vector; and
wherein for said voiced class, the input bit stream encodes pitch information for the encoded speech signal frame which is decoded in a pitch decoder block having an output coupled to a window generator block that generates at least one window based on the decoded pitch information, said at least one window being used to retrieve, from an adaptive code book, an adaptive code book vector used for generating an excitation vector which is multiplied by a gain element and added to an adaptive codebook excitation to give a total excitation for a voiced frame. - View Dependent Claims (35, 36, 37, 38)
a second multi-position switch element controlled by said output of said class decoder for coupling an output of the selected one of said excitation generators to an input of a synthesizer filter and, via a feedback path, also to said adaptive code book, an output of said synthesizer filter being coupled to an input of a postfilter having an output coupled to an output node of said decoder;
wherein parameters of said synthesis filter and said postfilter are based on parameters decoded from said input bit stream by said LSP decoder.
-
Specification