SPEECH PROCESSING APPARATUS, METHOD, AND COMPUTER PROGRAM PRODUCT
First Claim
1. A speech processing apparatus, comprising:
- a segmenting unit configured to divide a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal;
a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generates a group of first parameters in correspondence with the linguistic level;
a descriptor generating unit configured to generate a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text;
a model learning unit configured to classify the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level; and
a storage unit configured to store the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models.
1 Assignment
0 Petitions
Accused Products
Abstract
A method to generate a pitch contour for speech synthesis is proposed. The method is based on finding the pitch contour that maximizes a total likelihood function created by the combination of all the statistical models of the pitch contour segments of an utterance, at one or multiple linguistic levels. These statistical models are trained from a database of spoken speech, by means of a decision tree that for each linguistic level clusters the parametric representation of the pitch segments extracted from the spoken speech data with some features obtained from the text associated with that speech data. The parameterization of the pitch segments is performed in such a way, the likelihood function of any linguistic level can be expressed in terms of the parameters of one of the levels, thus allowing the maximization to be calculated with respect to the parameters of that level. Moreover, the parameterization of that main level has to be invertible so that the final pitch contour is obtained from the parameters of that level by means of an inverse transformation.
22 Citations
14 Claims
-
1. A speech processing apparatus, comprising:
-
a segmenting unit configured to divide a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal; a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generates a group of first parameters in correspondence with the linguistic level; a descriptor generating unit configured to generate a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text; a model learning unit configured to classify the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level; and a storage unit configured to store the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A speech processing method, comprising:
-
dividing a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal; generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with the linguistic level; generating a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text; classifying the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level; storing the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models in a storage unit.
-
-
14. A computer program product having a computer readable medium including programmed instructions for processing speech, wherein the instructions, when executed by a computer, cause the computer to perform:
-
dividing a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal; generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with the linguistic level; generating a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text; classifying the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level; storing the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models in a storage unit.
-
Specification