SPEECH PROCESSING APPARATUS, METHOD, AND COMPUTER PROGRAM PRODUCT

US 20090248417A1
Filed: 03/17/2009
Published: 10/01/2009
Est. Priority Date: 04/01/2008
Status: Active Grant

First Claim

Patent Images

1. A speech processing apparatus, comprising:

a segmenting unit configured to divide a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal;

a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generates a group of first parameters in correspondence with the linguistic level;

a descriptor generating unit configured to generate a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text;

a model learning unit configured to classify the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level; and

a storage unit configured to store the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method to generate a pitch contour for speech synthesis is proposed. The method is based on finding the pitch contour that maximizes a total likelihood function created by the combination of all the statistical models of the pitch contour segments of an utterance, at one or multiple linguistic levels. These statistical models are trained from a database of spoken speech, by means of a decision tree that for each linguistic level clusters the parametric representation of the pitch segments extracted from the spoken speech data with some features obtained from the text associated with that speech data. The parameterization of the pitch segments is performed in such a way, the likelihood function of any linguistic level can be expressed in terms of the parameters of one of the levels, thus allowing the maximization to be calculated with respect to the parameters of that level. Moreover, the parameterization of that main level has to be invertible so that the final pitch contour is obtained from the parameters of that level by means of an inverse transformation.

22 Citations

View as Search Results

14 Claims

1. A speech processing apparatus, comprising:
- a segmenting unit configured to divide a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal;
  
  a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generates a group of first parameters in correspondence with the linguistic level;
  
  a descriptor generating unit configured to generate a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text;
  
  a model learning unit configured to classify the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level; and
  
  a storage unit configured to store the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The apparatus according to claim 1, wherein the segmenting unit further includesa re-sampling unit configured to extract, from the fundamental frequency, a plurality of pitch frequencies that match a predetermined condition,an interpolating unit configured to perform an interpolation of the pitch frequencies extracted by the re-sampling unit and smooth the fundamental frequency, andthe segmenting unit divides the interpolated pitch contour into the segments that correspond to the linguistic level.
  - 3. The apparatus according to claim 1, wherein in addition to the invertible parametric representation, the parameterizing unit further includes an additional description-parameter calculating unit configured to calculate a set of description parameters representing further characteristics of the first set of parameters such as their variance, in such a way that the model learning unit conducts learning with respect to an expanded parameter obtained by combining for each unit of the linguistic level, the first parameter set with its associated description parameter set.
  - 4. The apparatus according to claim 1, wherein in addition to the invertible parametric representation, the parameterizing unit further comprises an additional concatenation parameter calculating unit configured to calculate a set of concatenation parameters representing the relationship between adjacent pitch segments of the linguistic level such as the primary derivative of the average of the fundamental frequency of current and adjacent pitch segments, or the gradient of the fundamental frequency at the connection point of the pitch segments for the linguistic level, whereinthe model learning unit conducts learning with respect to an expanded parameter obtained by combining for each unit of the linguistic level, the first parameter set with its associated concatenation parameter set.
  - 5. The apparatus according to claim 1, wherein the model learning unit classifies the parametric representation of the pitch segments of the linguistic level into groups by means of a decision tree that uses the set of features contained in the descriptor generated by the descriptor generating unit.
  - 6. The apparatus according to claim 5, wherein the decision tree classifies the parametric representation of the pitch segments in such a way as to minimize the total mean square error in the non-transformed pitch contour space, the error being calculated from the first set of parameter of the pitch segments and their associated duration.
  - 7. The apparatus according to claim 5, wherein the decision tree classifies the parametric representation of the pitch segments in such a way as to maximize the total logarithmic likelihood (log-likelihood), the log-likelihood being calculated from the parametric representation of the pitch segments and their associated duration.
  - 8. The apparatus according to claim 1, wherein the linguistic level relates to any one of a frame, a phoneme, a syllable, a word, a phrase, a breath group, an utterance, or any combination thereof.
  - 9. The apparatus according to claim 1, wherein the transform is any one of invertible linear transforms including a discrete cosine transform, a Fourier transform, a wavelet transform, a Taylor expansion, and a polynomial expansion.
  - 10. The apparatus according to claim 1, further comprising:
    - a selecting unit configured to select from the storage unit a pitch segment model corresponding to each descriptor, for a single linguistic level or a plurality of linguistic levels;
      
      an objective function generating unit configured to generate an objective function from a group of pitch segment models selected for each linguistic level;
      
      an objective function maximizing unit configured to generate a set of first parameters corresponding to character strings of the reference linguistic level that maximize a weighted sum of the objective functions of each linguistic level with respect to the first parameter set of a reference linguistic level; and
      
      an inverse transform performing unit configured to perform an inverse transform on the first parameter set generated from the maximization of the objective function by the maximizing unit, and generates a pitch contour.
  - 11. The apparatus according to claim 10, wherein the objective functions generated by the objective function generating unit are defined in terms of the first parameter set of the reference linguistic level.
  - 12. The apparatus according to claim 11, wherein the objective function generating unit generates the objective function of the linguistic level as a likelihood function of the first parameters of the reference linguistic level.

13. A speech processing method, comprising:
- dividing a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal;
  
  generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with the linguistic level;
  
  generating a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text;
  
  classifying the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level;
  
  storing the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models in a storage unit.

14. A computer program product having a computer readable medium including programmed instructions for processing speech, wherein the instructions, when executed by a computer, cause the computer to perform:
- dividing a fundamental frequency of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between character strings of each linguistic level included in the input text and the speech signal;
  
  generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with the linguistic level;
  
  generating a descriptor which consists of a set of features describing the character strings, for each of the character strings in the linguistic level included in the input text;
  
  classifying the first parameters of the linguistic level of all the speech signal in the database into clusters based on the descriptor corresponding to the linguistic level, and learns for each of the clusters a pitch segment model for the linguistic level;
  
  storing the pitch segment models for each linguistic level together with the mapping rules between the descriptors describing the features of the character strings for the linguistic level, and the pitch segment models in a storage unit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Latorre, Javier, Akamine, Masami

Granted Patent

US 8,407,053 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/0335 Pitch control

G10L 13/10 Prosody rules derived from ...

SPEECH PROCESSING APPARATUS, METHOD, AND COMPUTER PROGRAM PRODUCT

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

22 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH PROCESSING APPARATUS, METHOD, AND COMPUTER PROGRAM PRODUCT

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links