Speech synthesizer, speech synthesizing method and program product
First Claim
1. A speech synthesizer comprising:
- a processor;
an analyzer that performs a text analysis of an input document and extracts a linguistic feature used for prosody control;
a first estimator that selects a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information and that estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model;
a selector that selects, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated by the first estimator;
a generator that generates a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit;
a second estimator that re-estimates prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model; and
a synthesizer that generates synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated by the second estimator,wherein the processor executes at least one of the analyzer, the first estimator, the selector, the generator, the second estimator, and the synthesizer.
1 Assignment
0 Petitions
Accused Products
Abstract
According to one embodiment, a speech synthesizer includes an analyzer, a first estimator, a selector, a generator, a second estimator, and a synthesizer. The analyzer analyzes text and extracts a linguistic feature. The first estimator selects a first prosody model adapted to the linguistic feature and estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model. The selector selects speech units that minimize a cost function determined in accordance with the prosody information. The generator generates a second prosody model that is a model of the prosody information of the speech units. The second estimator estimates prosody information that maximizes a third likelihood calculated on the basis of the first likelihood and a second likelihood representing probability of the second prosody model. The synthesizer generates synthetic speech by concatenating the speech units on the basis of the prosody information estimated by the second estimator.
-
Citations
5 Claims
-
1. A speech synthesizer comprising:
-
a processor; an analyzer that performs a text analysis of an input document and extracts a linguistic feature used for prosody control; a first estimator that selects a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information and that estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model; a selector that selects, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated by the first estimator; a generator that generates a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit; a second estimator that re-estimates prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model; and a synthesizer that generates synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated by the second estimator, wherein the processor executes at least one of the analyzer, the first estimator, the selector, the generator, the second estimator, and the synthesizer. - View Dependent Claims (2, 3)
-
-
4. A speech synthesis method comprising:
-
performing a text analysis of an input document and extracting a linguistic feature used for prosody control; selecting a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information, and first estimating in which prosody information that maximizes a first likelihood representing probability of the selected first prosody model is estimated; selecting, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated in the first estimating; generating a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit; second estimating in which prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model is estimated; and generating synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated in the second estimating.
-
-
5. Non-transitory computer readable medium including programmed instructions, wherein the instructions, when executed by a computer, causes the computer to perform:
-
performing an text analysis of an input document and extracting a linguistic feature used for prosody control; selecting a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information, and first estimating in which prosody information that maximizes a first likelihood representing probability of the selected first prosody model is estimated; selecting, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated in the first estimating; generating a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit; second estimating in which prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model is estimated; and generating synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated in the second estimating.
-
Specification