Method and system of runtime acoustic unit selection for speech synthesis
First Claim
1. A computer readable medium having stored thereon a speech synthesizer, comprising:
- a speech unit store generated according to the steps of;
obtaining an estimate of hidden Markov models (HMMs) for a plurality of speech units;
receiving training data as a plurality of speech waveforms;
segmenting the speech waveforms by performing the steps of;
obtaining text associated with the speech waveforms; and
converting the text into a speech unit stringformed of a plurality of training speech units;
re-estimating the HMMs based on the training speech units, each HMM having a plurality of states, each state having a corresponding senone; and
repeating the steps of segmenting and re-estimating until a probability of the parameters of the HMMs generating the plurality of speech waveforms reaches a threshold level; and
mapping each waveform to one or more states and corresponding senones of the HMMs to form a plurality of instances corresponding to each training speech unit and storing the plurality of instances in the speech unit store; and
a speech synthesizer component configured to synthesize an input linguistic expression by performing the steps of;
converting the input linguistic expression into a sequence of input speech units;
generating a plurality of sequences of instances corresponding to the sequence of input speech units based on the plurality of instances in the speech unit store; and
generating speech based on one of the sequences of instances having a lowest dissimilarity between adjacent instances in the sequence of instances.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention pertains to a concatenative speech synthesis system and method which produces a more natural sounding speech. The system provides for multiple instances of each acoustic unit which can be used to generate a speech waveform representing an linguistic expression. The multiple instances are formed during an analysis or training phase of the synthesis process and are limited to a robust representation of the highest probability instances. The provision of multiple instances enables the synthesizer to select the instance which closely resembles the desired instance thereby eliminating the need to alter the stored instance to match the desired instance. This in essence minimizes the spectral distortion between the boundaries of adjacent instances thereby producing more natural sounding speech.
424 Citations
19 Claims
-
1. A computer readable medium having stored thereon a speech synthesizer, comprising:
-
a speech unit store generated according to the steps of; obtaining an estimate of hidden Markov models (HMMs) for a plurality of speech units; receiving training data as a plurality of speech waveforms; segmenting the speech waveforms by performing the steps of; obtaining text associated with the speech waveforms; and converting the text into a speech unit string formed of a plurality of training speech units; re-estimating the HMMs based on the training speech units, each HMM having a plurality of states, each state having a corresponding senone; and repeating the steps of segmenting and re-estimating until a probability of the parameters of the HMMs generating the plurality of speech waveforms reaches a threshold level; and mapping each waveform to one or more states and corresponding senones of the HMMs to form a plurality of instances corresponding to each training speech unit and storing the plurality of instances in the speech unit store; and a speech synthesizer component configured to synthesize an input linguistic expression by performing the steps of; converting the input linguistic expression into a sequence of input speech units; generating a plurality of sequences of instances corresponding to the sequence of input speech units based on the plurality of instances in the speech unit store; and generating speech based on one of the sequences of instances having a lowest dissimilarity between adjacent instances in the sequence of instances. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method of performing speech synthesis, comprising:
-
obtaining an estimate of hidden Markov models (HMMs) for a plurality of speech units; receiving training data as a plurality of speech waveforms; segmenting the speech waveforms by performing the steps of; obtaining text associated with the speech waveforms; and converting the text into a speech unit string formed of a plurality of training speech units; re-estimating the HMMs based on the training speech units, each HMM having a plurality of states, each state having a corresponding senone; repeating the steps of segmenting and re-estimating until a probability of the parameters of the HMMs generating the plurality of speech waveforms reaches a threshold level; mapping each waveform to one or more states and corresponding senones of the HMMs to form a plurality of speech unit instances corresponding to each training speech unit, and storing the plurality of speech unit instances; receiving an input linguistic expression; converting the input linguistic expression into a sequence of input speech units; generating a plurality of sequences of instances corresponding to the sequence of input speech units based on the plurality of speech unit instances stored; and generating speech based on one of the sequences of instances having a lowest dissimilarity between adjacent instances in the sequence of instances. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
Specification