Speech processing apparatus, method, and computer program product for synthesizing speech

US 8,407,053 B2
Filed: 03/17/2009
Issued: 03/26/2013
Est. Priority Date: 04/01/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A speech processing apparatus, comprising:

a segmenting unit configured to divide a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level;

a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generate a group of first parameters in correspondence with each linguistic level;

a descriptor generating unit configured to generate, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text;

a model learning unit configured to classify the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learn, for each of the clusters, a pitch segment model for the linguistic level; and

a storage unit configured to store the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the sample, for the linguistic level and the pitch segment models.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech processing apparatus, including a segmenting unit to divide a fundamental frequency signal of a speech signal corresponding to an input text into pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal. Character strings of the input text are divided into the samples based on each linguistic level. A parameterizing unit generates a parametric representation of the pitch segments using a predetermined invertible operator and generates a group of first parameters in correspondence with each linguistic level. A descriptor generating unit generates, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text and a model learning unit classifies the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level.

Citations

14 Claims

1. A speech processing apparatus, comprising:
- a segmenting unit configured to divide a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level;
  
  a parameterizing unit configured to generate a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generate a group of first parameters in correspondence with each linguistic level;
  
  a descriptor generating unit configured to generate, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text;
  
  a model learning unit configured to classify the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learn, for each of the clusters, a pitch segment model for the linguistic level; and
  
  a storage unit configured to store the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the sample, for the linguistic level and the pitch segment models.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The apparatus according to claim 1, wherein the segmenting unit further comprises:
    - a re-sampling unit configured to extract, from the fundamental frequency, a plurality of pitch frequencies that match a predetermined condition,an interpolating unit configured to perform an interpolation of the pitch frequencies extracted by the re-sampling unit and smooth the fundamental frequency to obtain an interpolated pitch contour, whereinthe segmenting unit divides the interpolated pitch contour into the pitch segments that correspond to the linguistic level.
  - 3. The apparatus according to claim 1, wherein in addition to the invertible parametric representation, the parameterizing unit further includes an additional description-parameter calculating unit configured to calculate a set of description parameters representing further characteristics of the first parameters such as their variance, so that the model learning unit conducts learning with respect to an expanded parameter obtained by combining, for each linguistic level, the first parameters, with its associated description parameter set.
  - 4. The apparatus according to claim 1, wherein in addition to the invertible parametric representation, the parameterizing unit further comprises an additional concatenation parameter calculating unit configured to calculate a set of concatenation parameters representing a relationship between adjacent pitch segments of the linguistic level including a primary derivative of the average of the fundamental frequency of current and adjacent pitch segments, or a gradient of the fundamental frequency at a connection point of the pitch segments for the linguistic level, whereinthe model learning unit conducts learning with respect to an expanded parameter obtained by combining, for each linguistic level, the first parameters with its associated concatenation parameter set.
  - 5. The apparatus according to claim 1, wherein the model learning unit classifies the parametric representation of the pitch segments of each linguistic level into groups by means of a decision tree that uses the set of features contained in the descriptor generated by the descriptor generating unit.
  - 6. The apparatus according to claim 5, wherein the decision tree classifies the parametric representation of the pitch segments so as to minimize a total mean square error in a non-transformed pitch contour space, the error being calculated from the first parameters of the pitch segments and their associated duration.
  - 7. The apparatus according to claim 5, wherein the decision tree classifies the parametric representation of the pitch segments so as to maximize a total logarithmic likelihood (log-likelihood), the log-likelihood being calculated from the parametric representation of the pitch segments and their associated duration.
  - 8. The apparatus according to claim 1, wherein the linguistic level relates to any one of a frame, a phoneme, a syllable, a word, a phrase, a breath group, an utterance, or any combination thereof.
  - 9. The apparatus according to claim 1, wherein the transform is any one of invertible linear transforms including a discrete cosine transform, a Fourier transform, a wavelet transform, a Taylor expansion, and a polynomial expansion.
  - 10. The apparatus according to claim 1, further comprising:
    - a selecting unit configured to select from the storage unit a pitch segment model corresponding to each descriptor, for a single linguistic level or a plurality of linguistic levels;
      
      an objective function generating unit configured to generate an objective function from a group of pitch segment models selected for each linguistic level;
      
      an objective function maximizing unit configured to generate the first parameters corresponding to character strings of the reference linguistic level that maximize a weighted sum of the objective functions of each linguistic level with respect to the first parameters of a reference linguistic level; and
      
      an inverse transform performing unit configured to perform an inverse transform on the first parameters generated from the maximization of the objective function by the maximizing unit, and generate a pitch contour.
  - 11. The apparatus according to claim 10, wherein the objective functions generated by the objective function generating unit are defined in terms of the first parameters of the reference linguistic level.
  - 12. The apparatus according to claim 11, wherein the objective function generating unit is configured to generate the objective function of the linguistic level as a likelihood function of the first parameters of the reference linguistic level.

13. A speech processing method, comprising:
- dividing a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level;
  
  generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with each linguistic level;
  
  generating, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text;
  
  classifying the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learning, for each of the clusters, a pitch segment model for the linguistic level;
  
  storing the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the samples for the linguistic level and the pitch segment models in a storage unit.

14. A non-transitory computer-readable medium including programmed instructions for processing speech, wherein the instructions, when executed by a computer, cause the computer to perform:
- dividing a fundamental frequency signal of a speech signal corresponding to an input text into a plurality of pitch segments, based on an alignment between samples of at least one given linguistic level included in the input text and the speech signal, wherein character strings of the input text are divided into the samples based on each linguistic level;
  
  generating a parametric representation of the pitch segments by means of a predetermined invertible operator such as a linear transform, and generating a group of first parameters in correspondence with each linguistic level;
  
  generating, for each linguistic level, a descriptor that includes a set of features describing each sample in the input text;
  
  classifying the first parameters of each linguistic level of all speech signals in a memory into clusters based on the descriptor corresponding to the linguistic level, and learning, for each of the clusters, a pitch segment model for the linguistic level;
  
  storing the pitch segment models for each linguistic level together with mapping rules between the descriptors describing the features of the samples for the linguistic level and the pitch segment models in a storage unit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Latorre, Javier, Akamine, Masami
Primary Examiner(s)
Smits, Talivaldis Ivars

Application Number

US12/405,587
Publication Number

US 20090248417A1
Time in Patent Office

1,470 Days
Field of Search

704/260
US Class Current

704/260
CPC Class Codes

G10L 13/0335 Pitch control

G10L 13/10 Prosody rules derived from ...

Speech processing apparatus, method, and computer program product for synthesizing speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Speech processing apparatus, method, and computer program product for synthesizing speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links