Methods and apparatus for formant-based voice synthesis
First Claim
1. A method of processing a voice signal to extract information to facilitate training a speech synthesis model for use with a formant-based text-to-speech synthesizer, the method comprising acts of:
- detecting a plurality of candidate features in the voice signal;
grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets;
forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets;
performing at least one comparison between the voice signal and each of the plurality of voice waveforms;
selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and
training the speech synthesis model based, at least in part, on the selected at least one of the plurality of candidate feature sets.
8 Assignments
0 Petitions
Accused Products
Abstract
In one aspect, a method of processing a voice signal to extract information to facilitate training a speech synthesis model is provided. The method comprises acts of detecting a plurality of candidate features in the voice signal, performing at least one comparison between one or more combinations of the plurality of candidate features and the voice signal, and selecting a set of features from the plurality of candidate features based, at least in part, on the at least one comparison. In another aspect, the method is performed by executing a program encoded on a computer readable medium. In another aspect, a speech synthesis model is provided by, at least in part, performing the method.
-
Citations
27 Claims
-
1. A method of processing a voice signal to extract information to facilitate training a speech synthesis model for use with a formant-based text-to-speech synthesizer, the method comprising acts of:
-
detecting a plurality of candidate features in the voice signal; grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets; forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets; performing at least one comparison between the voice signal and each of the plurality of voice waveforms; selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and training the speech synthesis model based, at least in part, on the selected at least one of the plurality of candidate feature sets. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer readable medium encoded with a program for execution on at least one processor, the program, when executed on the at least one processor, performing a method of processing a voice signal to extract information to facilitate training a speech synthesis model for use with a formant-based text-to-speech synthesizer, the method comprising acts of:
-
detecting a plurality of candidate features in the voice signal; grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets; forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets; performing at least one comparison between the voice signal and each of the plurality of voice waveforms; selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and training the speech synthesis model based, at least in part, on the selected at least one of the plurality of candidate feature sets. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer readable medium encoded with a speech synthesis model for use with a formant-based text-to-speech synthesizer adapted to, when operating, generate human recognizable speech, the speech synthesis model trained to generate the human recognizable speech, at least in part, by performing acts of:
-
detecting a plurality of candidate features in the voice signal; grouping different combinations of the plurality of candidate features into a plurality of candidate feature sets; forming a plurality of voice waveforms, each of the plurality of voice waveforms formed, at least in part, by processing a respective one of the plurality of candidate feature sets; performing at least one comparison between the voice signal and each of the plurality of voice waveforms; selecting at least one of the plurality of candidate feature sets based, at least in part, on the at least one comparison with the voice signal; and training the speech synthesis model based, at least in part, on the selected at least one of the plurality of candidate feature sets. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
-
Specification