Method and apparatus for speech reconstruction in a distributed speech recognition system

US 20020147579A1
Filed: 02/02/2001
Published: 10/10/2002
Est. Priority Date: 02/02/2001
Status: Active Grant

First Claim

Patent Images

1. In a distributed speech recognition system comprising a first communication device which receives a speech input, encodes data representative of the speech input, and transmits the encoded data and a second remotely-located communication device which receives the encoded data and compares the encoded data with a known data set, a method of reconstructing the speech input at the second communication device comprising the steps of:

receiving encoded data including encoded spectral data and encoded energy data;

decoding the encoded spectral data and encoded energy data to determine the spectral data and energy data; and

combining the spectral data and energy data to reconstruct the speech input.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a distributed speech recognition system (20) comprising a first communication device (22) which receives a speech input (34), encodes data representative of the speech input (36, 38), and transmits the encoded data (42) and a second remotely-located communication device (26) which receives the encoded data (44) and compares the encoded data with a known data set, the device (26) including a processor (92) with a program which controls the processor (92) to operate according to a method of reconstructing the speech input including the step (44) of receiving encoded data including encoded spectral data and encoded energy data. The method further includes the step (46, 48) of decoding the encoded spectral data and encoded energy data to determine the spectral data and energy data. The method also includes the step (50, 52) of combining the spectral data and energy data to reconstruct the speech input.

Citations

22 Claims

1. In a distributed speech recognition system comprising a first communication device which receives a speech input, encodes data representative of the speech input, and transmits the encoded data and a second remotely-located communication device which receives the encoded data and compares the encoded data with a known data set, a method of reconstructing the speech input at the second communication device comprising the steps of:
- receiving encoded data including encoded spectral data and encoded energy data;
  
  decoding the encoded spectral data and encoded energy data to determine the spectral data and energy data; and
  
  combining the spectral data and energy data to reconstruct the speech input.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of reconstructing the speech input according to claim 1, wherein the receiving step comprises the step of receiving encoded data including spectral data encoded as a series of mel-frequency cepstral coefficients.
  - 3. The method of reconstructing the speech input according to claim 2, wherein the speech input has a pitch period and the decoding step comprises the steps of:
    - determining harmonic mel-frequencies corresponding to the pitch period;
      
      performing an inverse discrete cosine transform on the mel-frequency cepstral coefficients at the harmonic mel-frequencies to determine log-spectral magnitudes of the speech input at the harmonic mel-frequencies; and
      
      eexponentiating the log-spectral magnitudes to determine the spectral magnitudes of the speech input.
  - 4. The method of reconstructing the speech input according to claim 3, wherein the step of performing the inverse discrete cosine transform comprises the steps of:
    - determining a matrix comprising a plurality of column vectors, each column vector corresponding to one of a plurality of mel-frequencies;
      
      selecting a column vector from the matrix corresponding to one of the plurality of mel-frequencies closest in value to one of the harmonic mel-frequencies; and
      
      forming an inner product between a row vector formed from the series of mel-frequency cepstral coefficients and the selected column vector.
  - 5. The method of reconstructing the speech input according to claim 2, wherein the decoding step comprises the steps of:
    - determining mel-frequencies corresponding to a set of frequencies; and
      
      performing an inverse discrete cosine transform on the mel-frequency cepstral coefficients at the mel-frequencies to determine log-spectral magnitudes of the speech input at the mel-frequencies.
  - 6. The method of reconstructing the speech input according to claim 1, wherein:
    - the receiving step comprises the step of receiving encoded data including encoded additional excitation data;
      
      the decoding step comprises the step of decoding the encoded additional excitation data to determine the additional excitation data; and
      
      the combining step comprises the step of combining the spectral, energy and excitation data to reconstruct the speech input.
  - 7. The method of reconstructing the speech input according to claim 6, wherein the decoding step comprises the step of decoding the encoded additional excitation data to determine a pitch period and a voice class.

8. In a distributed speech recognition system comprising a first communication device which receives a speech input, encodes data representative of the speech input, and transmits the encoded data and a second remotely-located communication device which receives the encoded data and compares the encoded data with a known data set, a method of reconstructing the speech input at the second communication device comprising the steps of:
- receiving encoded data including encoded spectral data spectral data encoded as a series of mel-frequency cepstral coefficients and encoded energy data;
  
  performing an inverse discrete cosine transform on the mel-frequency cepstral coefficients at harmonic mel-frequencies corresponding to a pitch period of the speech input to determine log-spectral magnitudes of the speech input at the mel-harmonic frequencies; and
  
  exponentiating the log-spectral magnitudes to determine the spectral magnitudes of the speech input;
  
  decoding the encoded energy data to determine the energy data; and
  
  combining the spectral magnitudes and the energy data to reconstruct the speech input.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22)
- - 9. The method of reconstructing the speech input according to claim 8, wherein the step of performing the inverse discrete cosine transform comprises the steps of:
    - determining a matrix comprising a plurality of column vectors, each column vector corresponding to one of a plurality of mel-frequencies;
      
      selecting a column vector from the matrix corresponding to one of the plurality of mel-frequencies closest in value to one of the harmonic mel-frequencies; and
      
      forming an inner product between a row vector formed from the series of mel-frequency cepstral coefficients and the selected column vector.
  - 10. The method of reconstructing the speech input according to claim 8, further comprising the step of comparing the series of mel-frequency cepstral coefficients to a series of mel-frequency cepstral coefficients corresponding to an impulse response.
  - 11. The method of reconstructing the speech input according to claim 10, wherein the step of comparing comprises the step of subtracting a series of mel-frequency cepstral coefficients corresponding to an impulse response of a pre-emphasis filter from the series of mel-frequency cepstral coefficients.
  - 12. The method of reconstructing the speech input according to claim 8, wherein the speech input is divided into a series of frames and:
    - the step of receiving encoded data comprises the step of receiving encoded energy data including a natural logarithm of an average energy value for each frame in the series of frames; and
      
      the step of decoding the encoded energy data comprises the step of exponentiating the natural logarithm of the average energy value for each frame in the series of frames.
  - 13. The method of reconstructing the speech input according to claim 8, wherein:
    - the receiving step comprises the step of receiving encoded data including encoded additional excitation data;
      
      the decoding step comprises the step of decoding the encoded additional excitation data to determine the additional excitation data; and
      
      the combining step comprises the step of combining the spectral, energy and excitation data to reconstruct the speech input.
  - 14. The method of reconstructing the speech input according to claim 13, wherein the decoding step comprises the step of decoding the encoded excitation data to determine a pitch period and a voice class.
  - 15. The method of reconstructing the speech input according to claim 14, wherein the decoding step includes the step of decoding the encoded excitation data to determine sub-frame energy data.
  - 16. The method of reconstructing the speech input according to claim 8, wherein the step of performing an inverse discrete cosine transform includes the step of performing an inverse discrete cosine transform of higher resolution than a discrete cosine transform used to encode the spectral data as a series of mel-frequency cepstral coefficients.
  - 18. The communication device according to claim 17, wherein the program further controls the processor (i) to determine a matrix comprising a plurality of column vectors, each column vector corresponding to one of a plurality of mel-frequencies, (ii) to select a column vector from the matrix corresponding to one of the plurality of mel-frequencies closest in value to one of the harmonic mel-frequencies, and (iii) to form an inner product between a row vector formed from the series of mel-frequency cepstral coefficients and the selected column vector so as to perform the inverse discrete cosine transform.
  - 19. The communication device according to claim 18, wherein the program further controls the processor to subtract a series of mel-frequency cepstral coefficients corresponding to an impulse response from the series of mel-frequency cepstral coefficients before performing the inverse discrete cosine transform.
  - 20. The communication device according to claim 17, wherein the speech input is divided into a series of frames and the program further controls the processor (i) to receive encoded energy data including a natural logarithm of an average energy value for each frame in the series of frames, and (ii) to exponentiate the natural logarithm of the average energy value for each frame in the series of frames to determine the energy data.
  - 21. The communication device according to claim 17, wherein:
    - the program further controls the processor (i) to receive encoded data including encoded additional excitation data, and (ii) to decode the encoded additional excitation data to determine a pitch period and a voice class, and the speech synthesizer combines the spectral magnitudes, energy data, pitch period and voice class to reconstruct the speech input.
  - 22. The communication device according to claim 21, wherein the speech synthesizer comprises a sinusoidal vocoder-synthesizer.

17. In a distributed speech recognition system comprising a first communication device which receives a speech input, encodes data about the speech input, and transmits the encoded data and a second remotely-located communication device which receives the encoded data and compares the encoded data with a known data set, the second remotely-located communication device comprising:
- a processor including a program which controls the processor (i) to receive the encoded data including encoded spectral data spectral data encoded as a series of mel-frequency cepstral coefficients and encoded energy data, (ii) to perform an inverse discrete cosine transform on the mel-frequency cepstral coefficients at harmonic mel-frequencies corresponding to a pitch period of the speech input to determine log-spectral magnitudes of the speech input at the harmonic frequencies, (iii) to exponentiate the log-spectral magnitudes to determine the spectral magnitudes of the speech input, and (iv) to decode the encoded energy data to determine the energy data; and
  
  a speech synthesizer which combines the spectral magnitudes and the energy data to reconstruct the speech input.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google Technology Holdings LLC (Alphabet Inc.)
Original Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Inventors
Meunier, Jeffrey, Jasiuk, Mark A., Ramabadran, Tenkasi V., Kushner, William M.

Granted Patent

US 6,633,839 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/207
CPC Class Codes

G10L 15/30   Distributed recognition, e....

G10L 19/00   Speech or audio signals ana...

G10L 19/093   using sinusoidal excitation...

G10L 25/18   the extracted parameters be...

Method and apparatus for speech reconstruction in a distributed speech recognition system

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for speech reconstruction in a distributed speech recognition system

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links