Speech processing apparatus, method, and computer program product for synthesizing speech
First Claim
1. A speech processing apparatus, comprising:
- a segmenting unit configured to divide a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level;
a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generate a group of first parameters in correspondence with each linguistic level;
a descriptor generating unit configured to generate, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text;
a model learning unit configured to classify the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learn, for each of the clusters, a pitch segment model for the linguistic level; and
a storage unit configured to store the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the sample, for the linguistic level and the pitch segment models.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech processing apparatus, including a segmenting unit to divide a fundamental frequency signal of a speech signal corresponding to an input text into pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal. Character strings of the input text are divided into the samples based on each linguistic level. A parameterizing unit generates a parametric representation of the pitch segments using a predetermined invertible operator and generates a group of first parameters in correspondence with each linguistic level. A descriptor generating unit generates, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text and a model learning unit classifies the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level.
-
Citations
14 Claims
-
1. A speech processing apparatus, comprising:
-
a segmenting unit configured to divide a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level; a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generate a group of first parameters in correspondence with each linguistic level; a descriptor generating unit configured to generate, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text; a model learning unit configured to classify the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learn, for each of the clusters, a pitch segment model for the linguistic level; and a storage unit configured to store the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the sample, for the linguistic level and the pitch segment models. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A speech processing method, comprising:
-
dividing a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level; generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with each linguistic level; generating, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text; classifying the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learning, for each of the clusters, a pitch segment model for the linguistic level; storing the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the samples for the linguistic level and the pitch segment models in a storage unit.
-
-
14. A non-transitory computer-readable medium including programmed instructions for processing speech, wherein the instructions, when executed by a computer, cause the computer to perform:
-
dividing a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level; generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with each linguistic level; generating, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text; classifying the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learning, for each of the clusters, a pitch segment model for the linguistic level; storing the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the samples for the linguistic level and the pitch segment models in a storage unit.
-
Specification