Memory usage in a text-to-speech system
First Claim
1. A method of creating prosodic information for a concatenative text-to-speech synthesis system, comprising analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, compressing the first duration information by producing statistical data describing the behavior of the first duration information, storing said prosodic information wherein the first duration information is replaced by said statistical data, thereby reducing a memory capacity required for storing said prosodic information.
1 Assignment
0 Petitions
Accused Products
Abstract
In the concatenative text-to-speech system, high compression rate of duration data in the prosodic template is achieved by extracting statistical parameters describing behavior of actual duration values of instances of each given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, and storing only the extracted statistical parameters, instead of the original duration values. Entries of each given basic unit in the prosodic template is sorted and indexed in the order of increasing duration value. Consequently, the amount of duration data can be significantly reduced, while keeping the error statistically under acceptable range.
-
Citations
27 Claims
-
1. A method of creating prosodic information for a concatenative text-to-speech synthesis system, comprising
analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, compressing the first duration information by producing statistical data describing the behavior of the first duration information, storing said prosodic information wherein the first duration information is replaced by said statistical data, thereby reducing a memory capacity required for storing said prosodic information.
-
6. A method for concatenative text-to-speech synthesis, comprising
inputting a text, analyzing the text and producing phonetic presentation of the text, selecting from a memory, based on said phonetic presentation, prestored prosodic information including compressed duration information in form of statistical data that describes behavior of first duration information of a given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, decompressing said compressed duration information by producing from said statistical data an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed by means of a statistical function, selecting, based on the estimation of said first duration information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.
-
11. A device for a concatenative text-to-speech synthesis, comprising
a text analyzer producing phonetic presentation of a text input; -
a memory storing a lexicon for the text analyzer, voice data including acoustic units, and associated prosodic information for selection of said acoustic units, said prosodic information including compressed duration information in form of statistical data that describes behavior of first duration information of each syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, decompressor decompressing said compressed duration information by a predetermined statistical function producing an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed based on the statistical data;
a selector selecting, based on the estimation of said first duration information and other prosodic information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A mobile communication device, comprising
a data processing unit; -
a memory storing a lexicon for text analysis, voice data including acoustic units, and associated prosodic information for selection of said acoustic units, said prosodic information including compressed duration information in form of statistical data that describes behavior of first duration information of each syllable, and a program code that causes the data processing unit to analyze the text and producing phonetic presentation of a text input, to select from said memory, based on said phonetic presentation, compressed duration information of a given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, to decompress said compressed duration information by producing from said statistical data an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed by means of a statistical function, and to select, based on the estimation of said first duration information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech. - View Dependent Claims (17, 18, 19)
-
-
20. A data storage encoded with an executable program that, when run on a computing device, cause the device
to analyze the text and producing phonetic presentation of a text input, to select from said memory, based on said phonetic presentation, compressed duration information of a given syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, to decompress said compressed duration information by producing from said statistical data an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed by means of a statistical function, and to select, based on the estimation of said first duration information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.
-
22. A device for creating prosodic information for a concatenative text-to-speech synthesis system, comprising
analyzer analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, compressor compressing the first duration information by producing statistical data describing the behavior of the first duration information, memory storing said prosodic information wherein the first duration information is replaced by said statistical data, thereby reducing a memory capacity required for storing said prosodic information.
-
27. A concatenative text-to-speech synthesis system, comprising
means analysing training speech samples and generating acoustic units and associated prosodic information for selection of said acoustic units, said prosodic information including first duration information, means compressing the first duration information by producing statistical data describing the behavior of the first duration information of each syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed, means storing a lexicon for the text analyzer, voice data including said acoustic units, and said associated prosodic information containing said compressed duration information, means producing phonetic presentation of a text input; -
means decompressing said compressed duration information by a predetermined statistical function producing an estimation of said first duration information of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed based on the statistical data;
means selecting, based on the estimation of said first duration information and other prosodic information, a stored acoustic unit of the syllable, phoneme, half-phoneme, diphone, triphone or any other basic speech unit employed from an acoustic data database to be concatenated to form synthetic speech.
-
Specification