Method and System for Non-Parametric Voice Conversion
First Claim
1. A method comprising:
- training an source hidden Markov model (HMM) based speech features generator implemented by one or more processors of a system using speech signals of a source speaker, wherein the source HMM based speech features generator comprises a configuration of source HMM state models, each of the source HMM state models having a set of generator-model functions;
extracting speech features from speech signals of a target speaker to generate a target set of target-speaker vectors;
for each given source HMM state model of the configuration, determining a particular target-speaker vector from among the target set that most closely matches parameters of the set of generator-model functions of the given source HMM;
determining a fundamental frequency (F0) transform that speech-adapts F0 statistics of the source HMM based speech features generator to match F0 statistics of the speech of the target speaker;
constructing a converted HMM based speech features generator implemented by one or more processors of the system to be the same as the source HMM based speech features generator, but wherein the parameters of the set of generator-model functions of each source HMM state model of the converted HMM based speech features generator are replaced with the determined particular most closely matching target-speaker vector from among the target set; and
speech-adapting F0 statistics of the converted HMM based speech features generator using the F0 transform to thereby produce a speech-adapted converted HMM based speech features generator.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system is disclosed for non-parametric speech conversion. A text-to-speech (TTS) synthesis system may include hidden Markov model (HMM) HMM based speech modeling for both synthesizing output speech. A converted HMM may be initially set to a source HMM trained with a voice of a source speaker. A parametric representation of speech may be extract from speech of a target speaker to generate a set of target-speaker vectors. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each HMM state of the source HMM to a target-speaker vector. The HMM states of the converted HMM may be replaced with the matched target-speaker vectors. Transforms may be applied to further adapt the converted HMM to the voice of target speaker. The converted HMM may be used to synthesize speech with voice characteristics of the target speaker.
255 Citations
30 Claims
-
1. A method comprising:
-
training an source hidden Markov model (HMM) based speech features generator implemented by one or more processors of a system using speech signals of a source speaker, wherein the source HMM based speech features generator comprises a configuration of source HMM state models, each of the source HMM state models having a set of generator-model functions; extracting speech features from speech signals of a target speaker to generate a target set of target-speaker vectors; for each given source HMM state model of the configuration, determining a particular target-speaker vector from among the target set that most closely matches parameters of the set of generator-model functions of the given source HMM; determining a fundamental frequency (F0) transform that speech-adapts F0 statistics of the source HMM based speech features generator to match F0 statistics of the speech of the target speaker; constructing a converted HMM based speech features generator implemented by one or more processors of the system to be the same as the source HMM based speech features generator, but wherein the parameters of the set of generator-model functions of each source HMM state model of the converted HMM based speech features generator are replaced with the determined particular most closely matching target-speaker vector from among the target set; and speech-adapting F0 statistics of the converted HMM based speech features generator using the F0 transform to thereby produce a speech-adapted converted HMM based speech features generator. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method comprising:
-
implementing a source hidden Markov model (HMM) based speech features generator by one or more processors of a system, wherein the source HMM based speech features generator comprises a configuration of source HMM state models, each of the source HMM state models having a set of generator-model functions, and wherein the implemented source HMM based speech features generator is trained using speech signals of a source speaker; providing a set of target-speaker vectors, the set of target-speaker vectors having been generated from speech features extracted from speech signals of a target speaker; implementing a converted HMM based speech features generator that is the same as the source HMM based speech features generator, but wherein (i) parameters of the set of generator-model functions of each given source HMM state model of the converted HMM based speech features generator are replaced with a particular target-speaker vector from among the target set that most closely matches the parameters of the set of generator-model functions of the given source HMM, and (ii) fundamental frequency (F0) statistics of the converted HMM based speech features generator are speech-adapted using an F0 transform that speech-adapts F0 statistics of the source HMM based speech features generator to match F0 statistics of the speech of the target speaker; receiving an enriched transcription of a run-time text string by an input device of the system; using the converted HMM based speech features generator to convert the enriched transcription into corresponding output speech features; and generating a synthesized utterance of the enriched transcription using the output speech features, the synthesized utterance having voice characteristics of the target speaker.
-
-
12. A system comprising:
-
one or more processors; memory; and machine-readable instructions stored in the memory, that upon execution by the one or more processors cause the system to carry out functions including; implementing a source hidden Markov model (HMM) based speech features generator by one or more processors of a system, wherein the source HMM based speech features generator comprises a configuration of source HMM state models, each of the source HMM state models having a set of generator-model functions, and wherein the implemented source HMM based speech features generator is trained using speech signals of a source speaker; providing a set of target-speaker vectors, the set of target-speaker vectors having been generated from speech features extracted from speech signals of a target speaker; implementing a converted HMM based speech features generator that is the same as the source HMM based speech features generator, but wherein (i) parameters of the set of generator-model functions of each given source HMM state model of the converted HMM based speech features generator are replaced with a particular target-speaker vector from among the target set that most closely matches the parameters of the set of generator-model functions of the given source HMM, and (ii) fundamental frequency (F0) statistics of the converted HMM based speech features generator are speech-adapted using an F0 transform that speech-adapts F0 statistics of the source HMM based speech features generator to match F0 statistics of the speech of the target speaker. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. An article of manufacture including a computer-readable storage medium having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
-
implementing a source hidden Markov model (HMM) based speech features generator by one or more processors of a system, wherein the source HMM based speech features generator comprises a configuration of source HMM state models, each of the source HMM state models having a set of generator-model functions, and wherein the implemented source HMM based speech features generator is trained using speech signals of a source speaker; providing a set of target-speaker vectors, the set of target-speaker vectors having been generated from speech features extracted from speech signals of a target speaker; implementing a converted HMM based speech features generator that is the same as the source HMM based speech features generator, but wherein (i) parameters of the set of generator-model functions of each given source HMM state model of the converted HMM based speech features generator are replaced with a particular target-speaker vector from among the target set that most closely matches the parameters of the set of generator-model functions of the given source HMM, and (ii) fundamental frequency (F0) statistics of the converted HMM based speech features generator are speech-adapted using an F0 transform that speech-adapts F0 statistics of the source HMM based speech features generator to match F0 statistics of the speech of the target speaker. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. An article of manufacture including a computer-readable storage medium, having stored thereon program instructions that, upon execution by one or more processors of a system, cause the system to perform operations comprising:
-
training an source hidden Markov model (HMM) based speech features generator using speech signals of a source speaker, wherein the source HMM based speech features generator comprises a configuration of source HMM state models, each of the source HMM state models having a set of generator-model functions; extracting speech features from speech signals of a target speaker to generate a target set of target-speaker vectors; for each given source HMM state model of the configuration, determining a particular target-speaker vector from among the target set that most closely matches parameters of the set of generator-model functions of the given source HMM; determining a fundamental frequency (F0) transform that speech-adapts F0 statistics of the source HMM based speech features generator to match F0 statistics of the speech of the target speaker; constructing a converted HMM based speech features generator to be the same as the source HMM based speech features generator, but wherein the parameters of the set of generator-model functions of each source HMM state model of the converted HMM based speech features generator are replaced with the determined particular most closely matching target-speaker vector from among the target set; and speech-adapting F0 statistics of the converted HMM based speech features generator using the F0 transform to thereby produce a speech-adapted converted HMM based speech features generator.
-
Specification