Phonetic posteriorgrams for many-to-one voice conversion

US 10,176,819 B2
Filed: 06/09/2017
Issued: 01/08/2019
Est. Priority Date: 07/11/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

obtaining a target speech;

obtaining a source speech;

generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes;

extracting target mel-cepstral coefficients (MCEP) features from the target speech;

training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG;

generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and

converting the source speech into a converted speech using the source PPG and the trained second model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for converting speech using phonetic posteriorgrams (PPGs). A target speech is obtained and a PPG is generated based on acoustic features of the target speech. Generating the PPG may include using a speaker-independent automatic speech recognition (SI-ASR) system for equalizing different speakers. The PPG includes a set of values corresponding to a range of times and a range of phonetic classes, the phonetic classes corresponding to senones. A mapping between the PPG and one or more segments of the target speech is generated. A source speech is obtained, and the source speech is converted into a converted speech based on the PPG and the mapping.

18 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- obtaining a target speech;
  
  obtaining a source speech;
  
  generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes;
  
  extracting target mel-cepstral coefficients (MCEP) features from the target speech;
  
  training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG;
  
  generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and
  
  converting the source speech into a converted speech using the source PPG and the trained second model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computer-implemented method of claim 1, wherein the range of phonetic classes correspond to a range of senones.
  - 3. The computer-implemented method of claim 1, wherein the set of values correspond to posterior probabilities of each of the range of phonetic classes for each of the range of times, and wherein the target PPG comprises a matrix.
  - 4. The computer-implemented method of claim 1, wherein the source speech is different than the target speech.
  - 5. The computer-implemented method of claim 1, wherein the first model includes a speaker-independent automatic speech recognition (SI-ASR) system for equalizing different speakers.
  - 6. The computer-implemented method of claim 5, wherein the first model is trained for PPGs generation using a multi-speaker ASR corpus, an input being an MFCC feature vector of tth frame, denoted as X_t, and an output being a vector of posterior probabilities defined by P_t=(p(s|X_t)|s=1, 2, . . . , C) where p(s|X_t) is the posterior probability of each phonetic class s.
  - 7. The computer-implemented method of claim 5, wherein training the second model includes using a bidirectional long short-term memory based recurrent neural network (DBLSTM) structure to model the mapping.
  - 8. The computer-implemented method of claim 7, wherein converting the source speech into the converted speech includes converting the source PPG into converted MCEP features using the trained second model.

9. A non-transitory computer-readable medium comprising instructions that, when executed by a processor, cause the processor to perform operations including:
- obtaining a target speech;
  
  obtaining a source speech;
  
  generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes;
  
  extracting target mel-cepstral coefficients (MCEP) features from the target speech;
  
  training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG;
  
  generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and
  
  converting the source speech into a converted speech using the source PPG and the trained second model.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The non-transitory computer-readable medium of claim 9, wherein the range of phonetic classes correspond to a range of senones.
  - 11. The non-transitory computer-readable medium of claim 9, wherein the set of values correspond to posterior probabilities of each of the range of phonetic classes for each of the range of times, and wherein the target PPG comprises a matrix.
  - 12. The non-transitory computer-readable medium of claim 9, wherein the source speech is different than the target speech.
  - 13. The non-transitory computer-readable medium of claim 9, wherein the first model includes a speaker-independent automatic speech recognition (SI-ASR) system for equalizing different speakers.
  - 14. The non-transitory computer-readable medium of claim 13, wherein training the second model includes using a bidirectional long short-term memory based recurrent neural network (DBLSTM) structure to model the mapping.
  - 15. The non-transitory computer-readable medium of claim 14, wherein converting the source speech into the converted speech includes converting the source PPG into converted MCEP features using the trained second model.

16. A system comprising:
- a processor;
  
  a computer-readable medium in data communication with the processor, the computer-readable medium comprising instructions that, when executed by the processor, cause the processor to perform operations including;
  
  obtaining a target speech;
  
  obtaining a source speech;
  
  generating a target phonetic posteriorgram (PPG) of the target speech by driving a first model with acoustic features of the target speech, the target PPG including a set of values corresponding to a range of times and a range of phonetic classes;
  
  extracting target mel-cepstral coefficients (MCEP) features from the target speech;
  
  training a second model using the target MCEP features and the target PPG to obtain a mapping between the target MCEP features and the target PPG;
  
  generating a source PPG of the source speech by driving the first model with acoustic features of the source speech; and
  
  converting the source speech into a converted speech using the source PPG and the trained second model.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system of claim 16, wherein the range of phonetic classes correspond to a range of senones.
  - 18. The system of claim 16, wherein the set of values correspond to posterior probabilities of each of the range of phonetic classes for each of the range of times, and wherein the target PPG comprises a matrix.
  - 19. The system of claim 16, wherein the first model includes a speaker-independent automatic speech recognition (SI-ASR) system for equalizing different speakers.
  - 20. The system of claim 19, wherein training the second model includes using a bidirectional long short-term memory based recurrent neural network (DBLSTM) structure to model the mapping.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Chinese University of Hong Kong
Original Assignee
Chinese University of Hong Kong
Inventors
Sun, Lifa, Li, Kun, Wang, Hao, Kang, Shiyin, Meng, Mei Ling Helen
Primary Examiner(s)
Lerner, Martin

Application Number

US15/618,980
Publication Number

US 20180012613A1
Time in Patent Office

578 Days
Field of Search

704232, 704258, 704259, 704260, 704261, 704266, 704269
US Class Current
CPC Class Codes

G10L 13/033   Voice editing, e.g. manipul...

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/08   Speech classification or se...

G10L 15/16   using artificial neural net...

G10L 15/187   Phonemic context, e.g. pron...

G10L 2021/0135   Voice conversion or morphing

G10L 21/007   characterised by the proces...

G10L 25/24   the extracted parameters be...

Phonetic posteriorgrams for many-to-one voice conversion

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Phonetic posteriorgrams for many-to-one voice conversion

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links