Compressing & using a concatenative speech database in text-to-speech systems
First Claim
Patent Images
1. A method comprising:
- receiving diphone waveforms;
compressing the diphone waveforms into diphone residuals, wherein the compressing is performed using an encoder;
generating linear predictive coding (LPC) coefficients, wherein the LPC coefficients are generated by the encoder; and
storing the diphone residuals and the encoder-generated LPC coefficients in a compressed packet, wherein the compressed packet is generated by the encoder.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus are provided for compressing and using a concatenative speech database in TTS systems to improve the quality of speech output generated by handheld TTS systems by allowing synthesis to occur on the client. According to one embodiment of the present invention, a G.723 encoder receives diphone waveforms, and compresses them into diphone residuals. While compressing the diphone waveforms, the encoder generates Linear Predictive Coding (LPC) coefficients. The diphone residuals, and the encoder-generated LPC coefficients are then stored in encoder-generated compressed packet.
-
Citations
27 Claims
-
1. A method comprising:
-
receiving diphone waveforms;
compressing the diphone waveforms into diphone residuals, wherein the compressing is performed using an encoder;
generating linear predictive coding (LPC) coefficients, wherein the LPC coefficients are generated by the encoder; and
storing the diphone residuals and the encoder-generated LPC coefficients in a compressed packet, wherein the compressed packet is generated by the encoder. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
receiving diphone waveforms;
compressing the diphone waveforms into diphone residuals, wherein the compressing is performed using an encoder;
generating linear predictive coding (LPC) coefficients, wherein the LPC coefficients are generated by the encoder;
storing the diphone residuals and the coder-generated LPC coefficients in a compressed packet, wherein the compressed packet is generated by the encoder;
a waveform synthesizer requesting the diphone residuals;
locating the requested diphone residuals in the compressed packet;
extracting the located diphone residuals from the compressed packet; and
decompressing the extracted diphone residuals, wherein the decompressing is performed using a decoder; and
supplying the diphone residuals and the encoder-generated LPC coefficients to the waveform synthesizer. - View Dependent Claims (9, 10, 11, 13, 14, 15, 16, 17, 19)
-
-
12. A system for compressing and using concatenative speech databases in text-to-speech systems comprising:
-
a text-to-speech system;
a concatenative speech database; and
a coder.
-
-
18. A method of producing a compressed concatenative diphone database comprising:
-
compressing diphone waveforms and generating linear predictive coding (LPC) coefficients by applying an audio encoder to the diphone waveforms; and
storing compressed packets produced by the audio encoder and uncompressed pitch mark values as a compressed concatenative diphone database.
-
-
20. The method for a handheld device with a text-to-speech system using a compressed concatenative diphone database comprising:
-
compressing diphone waveforms into diphone residuals and generating linear predictive coding (LPC) coefficients by applying an audio encoder to the diphone waveforms;
storing compressed packets produced by the audio encoder and uncompressed pitch mark values as a compressed concatenative diphone database;
decompressing the compressed concatenative diphone database by applying an audio decoder to the diphone residuals and the LPC coefficients; and
synthesizing the decompressed concatenative diphone database including the uncompressed pitch mark values to produce an output by applying a waveform synthesizer. - View Dependent Claims (21, 22, 24, 25, 26, 27)
-
-
23. A concatenative speech database structure comprising:
-
diphone waveforms indicating smallest units of speech for efficient text-to-speech conversion that are derived from phonemes;
linear predictive coefficients of a difference equation for characterizing formants; and
pitch mark values marking positions in an utterance indicating varying pitch.
-
Specification