Method and apparatus for speech reconstruction in a distributed speech recognition system
First Claim
1. In a distributed speech recognition system comprising a first communication device which receives a speech input and a second communication device remotely located from the first communication device and communicatively coupled to the first communication device, a method of reconstructing the speech input at the second communication device comprising the steps of:
- receiving at the second communication device of the distributed speech recognition system encoded data sent by the first communication device of the distributed speech recognition system, the encoded data including encoded spectral data and encoded energy data;
selectively at the second communication device decoding the encoded spectral data and encoded energy data to determine the spectral data and energy data and extracting a speech recognition parameter from the encoded data; and
selectively combining the spectral data and energy data to reconstruct the speech input at the second communication device and matching the speech recognition parameter with a speech recognition data set.
4 Assignments
0 Petitions
Accused Products
Abstract
In a distributed speech recognition system comprising a first communication device which receives a speech input (34), encodes data representative of the speech input, and transmits the encoded data and a second remotely-located communication device which receives the encoded data and compares the encoded data with a known data set, the device including a processor with a program which controls the processor to operate according to a method of reconstructing the speech input including the step of receiving encoded data including encoded spectral data and encoded energy data. The method further includes the step of decoding the encoded spectral data and encoded energy data to determine the spectral data and energy data. The method also includes the step of combining the spectral data and energy data to reconstruct the speech input.
80 Citations
22 Claims
-
1. In a distributed speech recognition system comprising a first communication device which receives a speech input and a second communication device remotely located from the first communication device and communicatively coupled to the first communication device, a method of reconstructing the speech input at the second communication device comprising the steps of:
-
receiving at the second communication device of the distributed speech recognition system encoded data sent by the first communication device of the distributed speech recognition system, the encoded data including encoded spectral data and encoded energy data;
selectively at the second communication device decoding the encoded spectral data and encoded energy data to determine the spectral data and energy data and extracting a speech recognition parameter from the encoded data; and
selectively combining the spectral data and energy data to reconstruct the speech input at the second communication device and matching the speech recognition parameter with a speech recognition data set. - View Dependent Claims (2, 3, 4, 5, 6, 7)
determining harmonic mel-frequencies corresponding to the pitch period;
performing an inverse discrete cosine transform on the mel-frequency cepstral coefficients at the harmonic mel-frequencies to determine log-spectral magnitudes of the speech input at the harmonic mel-frequencies; and
exponentiating the log-spectral magnitudes to determine the spectral magnitudes of the speech input.
-
-
4. The method of reconstructing the speech input according to claim 3, wherein the step of performing the inverse discrete cosine transform comprises the steps of:
-
determining a matrix comprising a plurality of column vectors, each column vector corresponding to one of a plurality of mel-frequencies;
selecting a column vector from the matrix corresponding to one of the plurality of mel-frequencies closest in value to one of the harmonic mel-frequencies; and
forming an inner product between a row vector formed from the series of mel-frequency cepstral coefficients and the selected column vector.
-
-
5. The method of reconstructing the speech input according to claim 2, wherein the decoding step comprises the steps of:
-
determining mel-frequencies corresponding to a set of frequencies; and
performing an inverse discrete cosine transform on the mel-frequency cepstral coefficients at the mel-frequencies to determine log-spectral magnitudes of the speech input at the mel-frequencies.
-
-
6. The method of reconstructing the speech input according to claim 1, wherein:
-
the receiving step comprises the step of receiving encoded data including encoded additional excitation data;
the decoding step comprises the step of decoding the encoded additional excitation data to determine the additional excitation data; and
the combining step comprises the step of combining the spectral, energy and excitation data to reconstruct the speech input.
-
-
7. The method of reconstructing the speech input according to claim 6, wherein the decoding step comprises the step of decoding the encoded additional excitation data to determine a pitch period and a voice class.
-
8. In a distributed speech recognition system comprising a first communication device which receives a speech input, encodes data representative of the speech input, and transmits the encoded data and a second remotely-located communication device which receives the encoded data and compares the encoded data with a known data set, a method of reconstructing the speech input at the second communication device comprising the steps of:
-
receiving encoded data including encoded spectral data spectral data encoded as a series of mel-frequency cepstral coefficients and encoded energy data;
performing an inverse discrete cosine transform on the mel-frequency cepstral coefficients at harmonic mel-frequencies corresponding to a pitch period of the speech input to determine log-spectral magnitudes of the speech input at the mel-harmonic frequencies; and
exponentiating the log-spectral magnitudes to determine the spectral magnitudes of the speech input;
decoding the encoded energy data to determine the energy data; and
combining the spectral magnitudes and the energy data to reconstruct the speech input. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
determining a matrix comprising a plurality of column vectors, each column vector corresponding to one of a plurality of mel-frequencies;
selecting a column vector from the matrix corresponding to one of the plurality of mel-frequencies closest in value to one of the harmonic mel-frequencies; and
forming an inner product between a row vector formed from the series of mel-frequency cepstral coefficients and the selected column vector.
-
-
10. The method of reconstructing the speech input according to claim 8, further comprising the step of comparing the series of mel-frequency cepstral coefficients to a series of mel-frequency cepstral coefficients corresponding to an impulse response.
-
11. The method of reconstructing the speech input according to claim 10, wherein the step of comparing comprises the step of subtracting a series of mel-frequency cepstral coefficients corresponding to an impulse response of a pre-emphasis filter from the series of mel-frequency cepstral coefficients.
-
12. The method of reconstructing the speech input according to claim 8, wherein the speech input is divided into a series of frames and:
-
the step of receiving encoded data comprises the step of receiving encoded energy data including a natural logarithm of an average energy value for each frame in the series of frames; and
the step of decoding the encoded energy data comprises the step of exponentiating the natural logarithm of the average energy value for each frame in the series of frames.
-
-
13. The method of reconstructing the speech input according to claim 8, wherein:
-
the receiving step comprises the step of receiving encoded data including encoded additional excitation data;
the decoding step comprises the step of decoding the encoded additional excitation data to determine the additional excitation data; and
the combining step comprises the step of combining the spectral, energy and excitation data to reconstruct the speech input.
-
-
14. The method of reconstructing the speech input according to claim 13, wherein the decoding step comprises the step of decoding the encoded excitation data to determine a pitch period and a voice class.
-
15. The method of reconstructing the speech input according to claim 14, wherein the decoding step includes the step of decoding the encoded excitation data to determine sub-frame energy data.
-
16. The method of reconstructing the speech input according to claim 8, wherein the step of performing an inverse discrete cosine transform includes the step of performing an inverse discrete cosine transform of higher resolution than a discrete cosine transform used to encode the spectral data as a series of mel-frequency cepstral coefficients.
-
17. In a distributed speech recognition system comprising a first communication device which receives a speech input, encodes data about the speech input, and transmits the encoded data and a second remotely-located communication device which receives the encoded data and compares the encoded data with a known data set, the second remotely-located communication device comprising:
-
a processor including a program which controls the processor (i) to receive the encoded data including encoded spectral data spectral data encoded as a series of mel-frequency cepstral coefficients and encoded energy data, (ii) to perform an inverse discrete cosine transform on the mel-frequency cepstral coefficients at harmonic mel-frequencies corresponding to a pitch period of the speech input to determine log-spectral magnitudes of the speech input at the harmonic frequencies, (iii) to exponentiate the log-spectral magnitudes to determine the spectral magnitudes of the speech input, and (iv) to decode the encoded energy data to determine the energy data; and
a speech synthesizer which combines the spectral magnitudes and the energy data to reconstruct the speech input. - View Dependent Claims (18, 19, 20, 21, 22)
the program further controls the processor (i) to receive encoded data including encoded additional excitation data, and (ii) to decode the encoded additional excitation data to determine a pitch period and a voice class, and the speech synthesizer combines the spectral magnitudes, energy data, pitch period and voice class to reconstruct the speech input.
-
-
22. The communication device according to claim 21, wherein the speech synthesizer comprises a sinusoidal vocoder-synthesizer.
Specification