Communication system and method using a speaker dependent time-scaling technique

US 5,920,840 A
Filed: 02/28/1995
Issued: 07/06/1999
Est. Priority Date: 02/28/1995
Status: Expired due to Fees

First Claim

Patent Images

1. A method for time-scale modification of speech using a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA), the method comprising the steps of:

a) storing a portion of an input speech signal in a memory;

b) analyzing the portion of the input speech signal to determined at least one filtered pitch value;

c) calculating an estimated pitch value from the at least one filtered pitch value;

d) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value and;

e) time-scale compressing the input speech signal in response to the segment size determined.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for time-scale modification of speech using a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA) comprises the steps of storing a portion of an input speech signal in a memory, analyzing the portion of the input speech signal to determined at least one filtered pitch value, calculating an estimated pitch value (12) from the at least one filtered pitch value, determining a segment size (14) in response to the estimated pitch value (12), the segment size (14) having a value greater than the estimated pitch value (12), and time-scale compressing (18) the input speech signal in response to the segment size determined.

Citations

48 Claims

1. A method for time-scale modification of speech using a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA), the method comprising the steps of:
- a) storing a portion of an input speech signal in a memory;
  
  b) analyzing the portion of the input speech signal to determined at least one filtered pitch value;
  
  c) calculating an estimated pitch value from the at least one filtered pitch value;
  
  d) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value and;
  
  e) time-scale compressing the input speech signal in response to the segment size determined.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein said step of determining a segment size further comprises the step of dynamically adapting the segment size with a estimated pitch value determined directly from the input speech signal over consecutive portions of the input speech signal.
  - 3. The method of claim 1 further comprises a step of providing a degree of overlap equal to or greater than 0.5 optimized for enhanced output speech quality.
  - 4. The method of claim 1 further comprises a step of providing a degree of overlap less than 0.5 optimized for lower computational complexity.
  - 5. The method as recited in claim 1, wherein the step of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

6. A method for time-scale modification of speech using a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA), the method comprising the steps of:
- a) storing a portion of an input speech signal in a memory;
  
  b) determining at least one filtered pitch value from the portion of the input speech signal;
  
  c) calculating an estimated pitch value from the at least one filtered pitch value;
  
  d) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value;
  
  e) time-scale compressing the input speech signal in response to the segment size determined; and
  
  f) time-scale expanding the input speech signal in response to the segment size determined.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6, wherein said step of determining a segment size further comprises the step of dynamically adapting the segment size with estimated pitch values determined directly from consecutive portions of the input speech signal.
  - 8. The method of claim 6 further comprises a step of providing a degree of overlap equal to or greater than 0.5 optimized for enhanced output speech quality.
  - 9. The method of claim 6 further comprises a step of providing a degree of overlap less than 0.5 optimized for lower computational complexity.
  - 10. The method as recited in claim 6, wherein the step of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

11. A method for use in a voice capable device for time-scale modification of speech using a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA) to form an output signal, comprising the steps of:
- at an output device;
  
  a) determining at least one filtered pitch value from a portion of an input speech signal;
  
  b) calculating an estimated pitch value from the at least one filtered pitch value;
  
  c) determining an analysis segment size in response to estimated pitch value, the analysis segment size having a value greater than the estimated pitch value; and
  
  d) time-scale expanding the input speech signal to provide a resultant output speech signal.
- View Dependent Claims (12)
- - 12. The method as recited in claim 11, wherein the step of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

13. A method for time-scale modification of speech dependent upon a pitch period of a speaker using a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA), comprising the steps of:
- a) determining at least one filtered pitch value from a portion of an input speech signal;
  
  b) calculating an estimated pitch value from the at least one filtered pitch value;
  
  c) determining an analysis segment size being approximately twice the estimated pitch value;
  
  d) increasing a time-scaling factor above an average time-scaling factor if the estimated pitch value is below a predetermined threshold; and
  
  e) decreasing the time-scaling factor below an average time-scaling factor if the estimated pitch value is above the predetermined threshold.
- View Dependent Claims (14, 15, 16)
- - 14. The method for time-scale modification of speech of claim 13 further includes the step of:
    - f) assigning a degree of overlap during speech compression which is dependent upon the time-scaling factor used in either step d or e.
  - 15. The method for time-scale modification of speech of claim 13 further includes the step of:
    - f) expanding the speech by approximately 10 percent less than the time-scaling factor used in either step d or e.
  - 16. The method as recited in claim 13, wherein the step of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

17. A method for compressing a plurality of voice signals within a voice communication resource having a given bandwidth within a voice communication system, comprising the steps of:
- (a) subchanneling the voice communication resource and simultaneously placing at least one voice signal of the plurality of voice signals on a subchannel of a plurality of subchannels;
  
  (b) compressing a time of the at least one voice signal within the subchannel, wherein the step of compressing the time of the at least one voice signal includes the steps of;
  
  c) determining at least one filtered pitch value from a portion of the at least one voice signal;
  
  d) calculating an estimated pitch value from the at least one filtered pitch value for the at least one voice signal;
  
  e) determining a segment size for analysis approximately twice the estimated pitch value;
  
  f) increasing a time-scaling factor above an average time-scaling factor if the estimated pitch value is below a predetermined threshold; and
  
  g) decreasing the time-scaling factor below an average time-scaling factor if the estimated pitch value is above the predetermined threshold, wherein the result of steps (a) through (g) provide a plurality of compressed voice signals.
- View Dependent Claims (18, 19, 20)
- - 18. The method for time-scale modification of speech of claim 17, wherein the method further includes the step of:
    - h) assigning a degree of overlap during the compressing a time of the at least one voice signal which is dependent upon the time-scaling factor used in either step f or g.
  - 19. The method for time-scale modification of speech of claim 18 further includes the step of:
    - h) expanding each of the plurality of compressed voice signals by approximately 10 percent less than the time-scaling factor used in either step f or g.
  - 20. The method as recited in claim 17, wherein the step of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the at least one voice signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

21. A communication system using voice compression having at least one transmitter base station and a plurality of selective call receivers, comprising:
- at the at least one transmitter base station;
  
  an input device for receiving an audio signal,a processing device which compresses the audio signal to produce a compressed audio signal and which modulates the compressed audio signal using quadrature amplitude modulation to provide a processed signal,said processing device compresses the audio signal in accordance with the steps ofa) analyzing a portion of the audio signal to determined at least one filtered pitch value,b) calculating an estimated pitch value from the at least one filtered pitch value,c) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value, andd) time-scale compressing the audio signal in response to the segment size determined, anda quadrature amplitude modulation transmitter for transmitting the processed signal; and
  
  at each of the plurality of selective call receivers;
  
  a selective call receiver for receiving the processed signal which is transmitted,a processing device for demodulating the processed signal which is received using a quadrature amplitude demodulation technique and for time-scale expanding the processed signal which is demodulated to provide a reconstructed signal, andan amplifier for amplifying the reconstructed signal into an reconstructed audio signal.
- View Dependent Claims (22, 23, 24, 25, 26)
- - 22. The communication system of claim 21, wherein the quadrature amplitude modulation is single sideband modulation.
  - 23. The communication system of claim 21, wherein the quadrature amplitude modulation is in-phase (I) and quadrature (Q) modulation.
  - 24. The communication system of claim 21, wherein the communication system includes a plurality of transmitter base stations and the processed signal includes a control signal that requests information from at least one of the plurality of selective call receivers in a form of an acknowledgment signal that allows the communication system to target future messages to the at least one of the plurality of selective call receivers through the plurality of transmitter base stations.
  - 25. The communication system of claim 21, wherein the system further comprises:
    - at the at least one transmitter base stationa pilot carrier signal generator to serve as an amplitude and phase reference for distortion that occurs as a result of channel aberrations; and
      
      at the selective call receivera receiver circuit for detecting, filtering and responding to the amplitude and phase reference generated by the pilot carrier signal generator.
  - 26. The communication system as recited in claim 21, wherein the process of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the audio signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

27. A selective call receiver for receiving compressed voice signals, comprising:
- a selective call receiver for receiving a processed signal which is transmitted, the processed signal being processed in accordance with the steps of;
  
  a) analyzing a portion of an input speech signal to determined at least one filtered pitch value,b) calculating an estimated pitch value from the at least one filtered pitch value,c) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value, andd) time-scale expanding the input speech signal in response to the segment size determined;
  
  a processing device for demodulating the processed signal which is received using a single side band demodulation technique and a time-scale expansion technique to provide a reconstructed signal; and
  
  an amplifier for amplifying the reconstructed signal into an reconstructed audio signal.
- View Dependent Claims (28, 29)
- - 28. The selective call receiver of claim 27, wherein the selective call receiver further comprises:
    - a receiver circuit for detecting, filtering and responding to an amplitude and phase reference generated by a pilot carrier signal generator in a transmitter at a base station.
  - 29. The selective call receiver as recited in claim 27, wherein the process of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

30. A selective call paging base station for transmitting selective call signals on a communication resource having a predetermined bandwidth, comprising:
- an input device for receiving a plurality of audio signals;
  
  a means for subchanneling the communication resource into a predetermined number of subchannels;
  
  an amplitude compression and filtering module, for each subchannel of the predetermined number of subchannels, for compressing an amplitude of a respective audio signal and for filtering the respective audio signal;
  
  a time-scale compression module which provides compression of the respective audio signal for each of the predetermined number of subchannels,said time-scale compression module operating to generate a processed signal in accordance with the steps of;
  
  a) analyzing a portion of an input speech signal to determined at least one filtered pitch value,b) calculating an estimated pitch value from the at least one filtered pitch value,c) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value, andd) time-scale compressing the input speech signal in response to the segment size determined; and
  
  a quadrature amplitude modulation transmitter for transmitting the processed signal.
- View Dependent Claims (31, 32, 33)
- - 31. The selective call paging base station of claim 30, wherein the input device for receiving a plurality of audio signals comprises a paging terminal for receiving phone messages or data messages from a computing device.
  - 32. The selective call paging base station of claim 30, wherein the amplitude compression and filtering module comprises an anti-alias filter coupled to an analog-to-digital converter coupled to a band-pass filter coupled to an automatic gain controller.
  - 33. The selective call paging base station as recited in claim 30, wherein the process of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

34. A selective call receiver, comprising:
- a receiver having an analog to digital converter for receiving a compressed voice signal that has been compressed using a modified version of the Waveform Similarity based Overlap-Add (WSOLA) compression technique that uses a compression factor that is dependent upon a pitch period of a voice signal which is input in accordance with the steps of;
  
  a) analyzing a portion of the voice signal which is input to determined at least one filtered pitch value,b) calculating an estimated pitch value from the at least one filtered pitch value,c) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value, andd) time-scale compressing the voice signal in response to the segment size determined to generate the compressed voice signal, and providing therefrom a digitized received signal, wherein the compressed voice signal further contains data for determining an expansion factor from the compression factor used in compressing the voice signal; and
  
  a signal processor for processing the digitized received signal and for expanding the digitized received signal in accordance with the expansion factor to generate a processed signal.
- View Dependent Claims (35, 36, 37, 38, 39)
- - 35. The selective call receiver of claim 34, wherein the expansion factor is estimated to be about 10 percent less than the compression factor used in compressing the voice signal.
  - 36. The selective call receiver of claim 34, wherein the signal processor further filters a pilot carrier, performs automatic gain control using a feedforward loop, single sideband demodulation, and decompanding of the digitized received signal to provide a processed signal.
  - 37. The selective call receiver of claim 34, wherein the signal processor further filters a pilot carrier, performs automatic gain control using a feedforward loop, I and Q demodulation, and decompanding of the digitized received signal to provide a processed signal.
  - 38. The selective call receiver of claim 34, wherein the selective call receiver further comprises a digital to analog converter, a reconstruction filter for converting the processed signal into a digitized audio signal, and an amplifier for amplifying the digitized audio signal.
  - 39. The selective call receiver as recited in claim 34, wherein the process of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the voice signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

40. An electronic device that uses a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA) for time-scale modification of speech, comprising:
- memory for storing a portion of an input speech signal;
  
  a processor for analyzing a portion of an input speech signal to determine at least one filtered pitch value, for calculating an estimated pitch value from the at least one filtered pitch value, and for further determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value; and
  
  a means for time-scaling the input speech signal in response to the segment size determined.
- View Dependent Claims (41, 42, 43, 44, 45)
- - 41. The electronic device of claim 40, wherein the means for time-scaling is further in response to a predetermined degree of overlap ranging from 0 to 1.
  - 42. The electronic device of claim 40, wherein the electronic device comprises a dictation device.
  - 43. The electronic device of claim 40, wherein the electronic device comprises an answering machine.
  - 44. The electronic device of claim 40, wherein the electronic device comprises a voice mail system.
  - 45. The electronic device as recited in claim 40, wherein the process of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

46. A method for time-scale and frequency-scale modification of speech using a modified version of the Waveform Similarity based Overlap-Add technique (WSOLA), the method comprising the steps of:
- a) storing a portion of an input speech signal in a memory;
  
  b) analyzing the portion of the input speech signal to determined at least one filtered pitch value;
  
  c) calculate an estimated pitch value from the at least one filtered pitch value;
  
  d) determining a segment size in response to the estimated pitch value, the segment size having a value greater than the estimated pitch value;
  
  e) time-scaling the input speech signal in response to the segment size determined and a predetermined time-scaling factor, wherein time-scaling provides a time-scaled signal; and
  
  f) frequency-scaling of the time-scaled signal.
- View Dependent Claims (47, 48)
- - 47. The method of claim 46, wherein said step of frequency-scaling includes the step of interpolating by a factor equal to the predetermined time-scaling factor if the time-scaling factor is greater than 1.
  - 48. The method as recited in claim 46, wherein the step of determining the at least one filtered pitch value comprises the steps of:
    - subdividing the portion of the input speech signal into a plurality of blocks, each of the plurality of blocks having a predetermined time interval;
      
      computing an energy for each of the plurality of blocks;
      
      averaging the energy of each of the plurality of blocks, thereby providing an average energy per block;
      
      computing a threshold from the average energy per block;
      
      using the threshold to determine from the plurality of blocks at least one interval of voiced speech comprising at least a predetermined number of contiguous blocks from the plurality of blocks;
      
      calculating at least one pitch value from the at least one interval of voiced speech; and
      
      filtering the at least one pitch value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Original Assignee
Motorola, Inc. (Motorola Solutions, Inc.)
Inventors
Siwiak, Kazimierz, Satyamurti, Sunil, Schwendeman, Robert John, Leitch, Clifford Dana, Kuznicki, William Joseph
Primary Examiner(s)
Dorvil, Richemond

Application Number

US08/395,739
Time in Patent Office

1,589 Days
Field of Search

395/2.14, 395/2.76, 704/205, 704/268, 704/267, 704/211, 704/207, 704/500, 704/501, 704/503, 704/504
US Class Current

704/267
CPC Class Codes

G10L 21/04 Time compression or expansion

G10L 25/90 Pitch determination of spee...

Communication system and method using a speaker dependent time-scaling technique

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

48 Claims

Specification

Solutions

Use Cases

Quick Links

Communication system and method using a speaker dependent time-scaling technique

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

48 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links