Phonetic posteriorgrams for many-to-one voice conversion
First Claim
Patent Images
1. A computer-implemented method comprising:
- obtaining a target speech;
obtaining a source speech;
generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes;
extracting target mel-cepstral coefficients (MCEP) features from the target speech;
training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG;
generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and
converting the source speech into a converted speech using the source PPG and the trained second model.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for converting speech using phonetic posteriorgrams (PPGs). A target speech is obtained and a PPG is generated based on acoustic features of the target speech. Generating the PPG may include using a speaker-independent automatic speech recognition (SI-ASR) system for equalizing different speakers. The PPG includes a set of values corresponding to a range of times and a range of phonetic classes, the phonetic classes corresponding to senones. A mapping between the PPG and one or more segments of the target speech is generated. A source speech is obtained, and the source speech is converted into a converted speech based on the PPG and the mapping.
18 Citations
20 Claims
-
1. A computer-implemented method comprising:
-
obtaining a target speech; obtaining a source speech; generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes; extracting target mel-cepstral coefficients (MCEP) features from the target speech; training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG; generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and converting the source speech into a converted speech using the source PPG and the trained second model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer-readable medium comprising instructions that, when executed by a processor, cause the processor to perform operations including:
-
obtaining a target speech; obtaining a source speech; generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes; extracting target mel-cepstral coefficients (MCEP) features from the target speech; training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG; generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and converting the source speech into a converted speech using the source PPG and the trained second model. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
a processor; a computer-readable medium in data communication with the processor, the computer-readable medium comprising instructions that, when executed by the processor, cause the processor to perform operations including; obtaining a target speech; obtaining a source speech; generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes; extracting target mel-cepstral coefficients (MCEP) features from the target speech; training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG; generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and converting the source speech into a converted speech using the source PPG and the trained second model. - View Dependent Claims (17, 18, 19, 20)
-
Specification