Frame mapping approach for cross-lingual voice transformation

US 8,594,993 B2
Filed: 04/04/2011
Issued: 11/26/2013
Est. Priority Date: 04/04/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-readable memory storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:

performing formant-based frequency warping on fundamental frequencies and linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed LPC spectrums;

generating warped parameter trajectories based at least on the transformed fundamental frequencies and the transformed LPC spectrums; and

producing transformed target speech waveforms with voice characteristics of the first language that retain at least some voice characteristics of a target speaker using the warped parameter trajectories and features from target speech waveforms of the target speaker in a second language.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Frame mapping-based cross-lingual voice transformation may transform a target speech corpus in a particular language into a transformed target speech corpus that remains recognizable, and has the voice characteristics of a target speaker that provided the target speech corpus. A formant-based frequency warping is performed on the fundamental frequencies and the linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed LPC spectrums. The transformed fundamental frequencies and the transformed LPC spectrums are then used to generate warped parameter trajectories. The warped parameter trajectories are further used to transform the target speech waveforms in the second language to produce transformed target speech waveform with voice characteristics of the first language that nevertheless retain at least some voice characteristics of the target speaker.

51 Citations

View as Search Results

20 Claims

1. A computer-readable memory storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
- performing formant-based frequency warping on fundamental frequencies and linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed LPC spectrums;
  
  generating warped parameter trajectories based at least on the transformed fundamental frequencies and the transformed LPC spectrums; and
  
  producing transformed target speech waveforms with voice characteristics of the first language that retain at least some voice characteristics of a target speaker using the warped parameter trajectories and features from target speech waveforms of the target speaker in a second language.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer-readable memory of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of generating synthesized speech for an input text using the transformed target speech waveforms.
  - 3. The computer-readable memory of claim 2, instructions that, when executed, cause the one or more processors to perform an act of estimating the LPC spectrums of the source speech waveforms using a Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum (STRAIGHT) speech analysis.
  - 4. The computer-readable memory of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of extracting the fundamental frequencies of the source speech waveforms using pitch extraction.
  - 5. The computer-readable memory of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of obtaining linear spectrum pairs (LSPs) from the transformed LPC spectrums, wherein the generating further includes generating the warped parameter trajectories base at least on the transformed LPC spectrums and the LSPs that encapsulate the transformed LPC spectrums.
  - 6. The computer-readable memory of claim 1, further comprising instructions that, when executed, cause the one or more processors to perform an act of extracting the features that include fundamental frequencies, LSPs, and gains from the target speech waveforms.
  - 7. The computer-readable memory of claim 1, wherein the performing includes performing the formant-based frequency warping by:
    - aligning vowel segments embedded in a pair of speech utterances from a source speaker and a target speaker;
      
      selecting stationary portions of a predefined length from the aligned vowel segments; and
      
      defining a piece-wise linear interpolation function to warp the LPC spectrums based at least on a plurality of mapped formant pairs in the stationary portions, each mapped formant pair including a frequency anchor point for the source speaker and a frequency anchor point for the target speaker.
  - 8. The computer-readable memory of claim 1, wherein each frame of the transformed target speech waveforms in represented by a corresponding fundamental frequency, a corresponding LSP, and a corresponding gain, and wherein the producing the transformed target speech waveforms further includes:
    - selecting candidate frames of the target speech waveforms for a warped parameter trajectory based at least on distances between target frames in the warped parameter trajectory and the candidate frames; and
      
      concatenating the selected candidate frames to form a target speech waveform.
  - 9. The computer-readable memory of claim 1, wherein the source speech waveforms are stored in a source speaker speech corpus, further comprising instructions that, when executed, cause the one or more processors to perform an act of storing the transformed target speech waveforms in a transformed target speaker speech corpus.

10. A computer-implemented method, comprising:
- under control of one or more computing systems configured with executable instructions,performing formant-based frequency warping on fundamental frequencies and coding spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed coding spectrums;
  
  generating warped parameter trajectories based at least on the transformed fundamental frequencies and the transformed coding spectrums; and
  
  producing transformed target speech waveforms with voice characteristics of the first language that retain at least some voice characteristics of a target speaker using the warped parameter trajectories and features from target speech waveforms of the target speaker in the second language;
  
  training models based at least on the transformed speech target waveforms; and
  
  generating synthesized speech for an input text using the trained models.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The computer-implemented method of claim 10, further comprising receiving input text from a text-to-speech application or a language translation application.
  - 12. The computer-implemented method of claim 10, further comprising:
    - estimating the coding spectrums of the source speech waveforms using a Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum (STRAIGHT) speech analysis;
      
      extracting the fundamental frequencies of the source speech waveforms using pitch extraction; and
      
      obtaining linear spectrum pairs (LSPs) from the transformed coding spectrums,wherein the generating further includes generating the warped parameter trajectories base at least on the transformed coding spectrums and the LSPs.
  - 13. The computer-implemented method of claim 10, wherein the performing includes performing the formant-based frequency warping by:
    - aligning vowel segments embedded in a pair of speech utterances from a source speaker and a target speaker;
      
      selecting stationary portions of a predefined length from the aligned vowel segments; and
      
      defining a piece-wise linear interpolation function to warp the coding spectrums based at least on a plurality of mapped formant pairs in the stationary portions, each mapped formant pair including a frequency anchor point for the source speaker and a frequency anchor point for the target speaker.
  - 14. The computer-implemented method of claim 10, further comprising extracting the features that include fundamental frequencies, LSPs, and gains from the target speech waveforms.
  - 15. The computer-implemented method of claim 14, wherein each frame of the transformed target speech waveforms in represented by a corresponding fundamental frequency, a corresponding LSP, and a corresponding gain, and wherein the producing the transformed target speech waveforms further includes:
    - selecting candidate frames of the target speech waveforms for a warped parameter trajectory based at least on distances between target frames in the warped parameter trajectory and the candidate frames; and
      
      concatenating the selected candidate frames to form a target speech waveform.

16. A system, comprising:
- one or more processors; and
  
  a memory that includes a plurality of computer-executable components, the plurality of computer-executable components comprising;
  
  a frequency warping component to perform formant-based frequency warping on fundamental frequencies and coding spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed coding spectrums;
  
  a trajectory generation component to generate warped parameter trajectories based at least on the transformed fundamental frequencies and the transformed coding spectrums; and
  
  a trajectory tiling component to produce transformed target speech waveforms with voice characteristics of the first language that retain at least some voice characteristics of a target speaker using the warped parameter trajectories and features from target speech waveforms of the target speaker in the second language.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The system of claim 16, further comprising:
    - a Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum (STRAIGHT) analysis component to estimate the coding spectrums of the source speech waveforms;
      
      a pitch extraction component to extract fundamental frequencies of the source speech waveforms using pitch extraction; and
      
      a feature extraction component to extract the features that include fundamental frequencies, LSPs, and gains from the target speech waveforms.
  - 18. The system of claim 16, further comprising a speech synthesis component to generating synthesized speech for an input text using hidden markov models (HMMs) trained with the transformed target speech waveforms.
  - 19. The system of claim 16, further comprising a LPC analysis component to obtain linear spectrum pairs (LSPs) from the transformed LPC spectrums, wherein the frequency warping component is to perform the formant-based frequency warping by:
    - aligning vowel segments embedded in a pair of speech utterances from a source speaker and a target speaker;
      
      selecting stationary portions of a predefined length from the aligned vowel segments; and
      
      defining a piece-wise linear interpolation function to warp the LPC spectrums based at least on a plurality of mapped formant pairs in the stationary portions, each mapped formant pair including a frequency anchor point for the source speaker and a frequency anchor point for the target speaker.
  - 20. The system of claim 16, wherein each frame of the transformed target speech waveforms in represented by a corresponding fundamental frequency, a corresponding LSP, and a corresponding gain, and wherein the trajectory tiling component is to produce the transformed target speech waveforms by:
    - selecting candidate frames of the target speech waveforms for a warped parameter trajectory based at least on distances between target frames in the warped parameter trajectory and the candidate frames; and
      
      concatenating the selected candidate frames to form a target speech waveform.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Qian, Yao, Soong, Frank Kao-Ping
Primary Examiner(s)
Godbold, Douglas

Application Number

US13/079,760
Publication Number

US 20120253781A1
Time in Patent Office

967 Days
Field of Search

704/2, 704/9, 704/258, 704/269
US Class Current

704/2
CPC Class Codes

G10L 2021/0135 Voice conversion or morphing

G10L 21/003 Changing voice quality, e.g...

Frame mapping approach for cross-lingual voice transformation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

51 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Frame mapping approach for cross-lingual voice transformation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

51 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others