Speech synthesis using complex spectral modeling

US 20050131680A1
Filed: 01/31/2005
Published: 06/16/2005
Est. Priority Date: 09/13/2002
Status: Active Grant

First Claim

Patent Images

1. A method for processing a speech signal, comprising:

dividing the speech signal into a succession of frames;

identifying one or more of the frames as click frames;

extracting phase information from the click frames; and

encoding the speech signal using the phase information.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for processing a speech signal includes dividing the speech signal into a succession of frames, identifying one or more of the frames as click frames, and extracting phase information from the click frames. The speech signal is encoded using the phase information. Methods are also provided for modeling phase spectra of voiced frames and click frames.

67 Citations

View as Search Results

24 Claims

1. A method for processing a speech signal, comprising:
- dividing the speech signal into a succession of frames;
  
  identifying one or more of the frames as click frames;
  
  extracting phase information from the click frames; and
  
  encoding the speech signal using the phase information.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein encoding the speech signal comprises creating a database of speech segments, and comprising synthesizing a speech output using the database.
  - 3. The method according to claim 2, wherein synthesizing the speech output comprises aligning a phase of the click frames in the speech output using the phase information.
  - 4. The method according to claim 1, wherein identifying the one or more of the frames as click frames comprises analyzing a probability distribution of the frames, and identifying the click frames based on a property of the probability distribution.
  - 5. The method according to claim 4, wherein analyzing the probability distribution comprises computing an entropy of the frames.

6. A method for processing a speech signal, comprising:
- dividing the speech signal into a succession of frames;
  
  identifying some of the frames as unvoiced frames;
  
  processing the unvoiced frames to identify one or more click frames among the unvoiced frames; and
  
  encoding the speech signal by applying a first modeling method to the click frames and a second modeling method, different from the first modeling method, to the unvoiced frames that are not click frames.
- View Dependent Claims (7, 8)
- - 7. The method according to claim 6, wherein the first modeling method comprises extracting phase information from the click frames.
  - 8. The method according to claim 7, wherein identifying some of the frames as unvoiced frames comprises identifying other frames as voiced frames, and wherein encoding the speech signal comprises extracting the phase information from the voiced frames, as well as the click frames.

9. A method for processing a speech signal, comprising:
- dividing the speech signal into a succession of frames;
  
  identifying some of the frames as voiced frames;
  
  modeling a phase spectrum of each of at least some of the voiced frames as a linear combination of basis functions covering different, respective frequency channels, wherein the model parameters correspond to respective coefficients of the basis functions; and
  
  encoding the speech signal using the modeled phase spectrum.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 18)
- - 10. The method according to claim 9, and comprising modeling an amplitude spectrum of each of the at least some of the voiced frames, wherein encoding the speech signal comprises encoding the modeled phase and amplitude spectra.
  - 11. The method according to claim 10, and comprising identifying other frames as unvoiced frames, and modeling the amplitude spectrum of each of at least some of the unvoiced frames, wherein encoding the speech signal comprises encoding the modeled amplitude spectra of the at least some of the unvoiced frames.
  - 12. The method according to claim 11, wherein identifying the other frames as unvoiced frames comprises identifying a subset of the unvoiced frames as click frames, and comprising modeling the phase spectrum of each of at least some of the click frames, wherein encoding the speech signal comprises encoding the modeled phase spectra of the at least some of the click frames.
  - 13. The method according to claim 9, wherein modeling the phase spectrum comprises differentially adjusting the respective frequency channels of the basis functions responsively to an amplitude spectrum of the at least some of the voiced frames.
  - 14. The method according to claim 9, wherein modeling the phase spectrum comprises aligning and unwrapping respective phases of frequency components of the phase spectrum before computing the model parameters.
  - 15. The method according to claim 9, wherein encoding the speech signal comprises creating a database of speech segments, and comprising synthesizing a speech output using the database.
  - 16. The method according to claim 15, wherein generating the speech output comprises aligning phases of the voiced frames in the speech output using the modeled phase spectrum.
  - 18. The method according to claim 15, wherein computing the time-domain model comprises computing a vector of model parameters representing time-domain components of the phase spectrum of a first voiced frame in a segment of the speech signal, and determining one or more elements of the vector to update so as to represent the phase spectrum of at least a second voiced frame, subsequent to the first voiced frame in the segment.

17. A method for processing a speech signal, comprising:
- dividing the speech signal into a succession of frames;
  
  identifying some of the frames as voiced frames;
  
  computing a time-domain model of a phase spectrum of each of at least some of the voiced frames; and
  
  encoding the speech signal using the modeled phase spectrum.

19. A method for synthesizing speech, comprising:
- receiving spectral model parameters with respect to a voiced frame of the speech to be synthesized, the parameters comprising high-frequency parameters and low-frequency parameters;
  
  determining a pitch frequency of the voiced frame;
  
  applying the low-frequency parameters to one or more low harmonics of the pitch frequency in order to generate a low-frequency speech component;
  
  applying the high-frequency parameters to one or more high harmonics of the pitch frequency while applying a frequency jitter to the high harmonics in order to generate a high-frequency speech component; and
  
  combining the low- and high-frequency components of the voiced frame into a sequence of frames of the speech in order to generate an output speech signal.

20. Apparatus for processing a speech signal, comprising a speech processor, which is arranged to divide the speech signal into a succession of frames, to identify one or more of the frames as click frames, to extract phase information from the click frames, and to encode the speech signal using the phase information.

21. Apparatus for processing a speech signal, comprising a speech processor, which is arranged to divide the speech signal into a succession of frames, to identify some of the frames as voiced frames, to model a phase spectrum of each of at least some of the voiced frames as a linear combination of basis functions covering different, respective frequency channels, wherein the model parameters correspond to respective coefficients of the basis functions, and to encode the speech signal using the modeled phase spectrum.

22. Apparatus for processing a speech signal, comprising a speech processor, which is arranged to divide the speech signal into a succession of frames, to identify some of the frames as voiced frames, to compute a time-domain model of a phase spectrum of each of at least some of the voiced frames, and to encode the speech signal using the modeled phase spectrum.

23. Apparatus for synthesizing a speech signal, comprising:
- a memory, which is arranged to store a database of speech segments, each segment comprising a succession of frames, such that at least some of the frames are identified as voiced frames, and the database comprises an encoded model of a phase spectrum of each of at least some of the voiced frames; and
  
  a speech synthesizer, which is arranged to synthesize a speech output comprising one or more of the voiced frames using the encoded model of the phase spectrum in the database.

24. A computer software product for processing a speech signal, the product comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to divide the speech signal into a succession of frames, to identify one or more of the frames as click frames, to extract phase information from the click frames, and to encoded the speech signal using the phase information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Kons, Zvi, Sorin, Alexander, Hoory, Ron, Shechtman, Slava, Chazan, Dan

Granted Patent

US 8,280,724 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/205
CPC Class Codes

G10L 13/08 Text analysis or generation...

G10L 19/02 using spectral analysis, e....

Speech synthesis using complex spectral modeling

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

67 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesis using complex spectral modeling

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

67 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links