Voice conversion method and system

US 8,234,110 B2
Filed: 09/29/2008
Issued: 07/31/2012
Est. Priority Date: 09/29/2007
Status: Active Grant

First Claim

Patent Images

1. A voice conversion method comprising:

performing speech analysis on speech of a source speaker to attain speech information comprising a first spectrum;

converting the first spectrum to a second spectrum, wherein converting the first spectrum to the second spectrum comprises compensating for at least one spectral difference between the speech of the source speaker and speech of a target speaker;

in response to converting the first spectrum to the second spectrum, generating a third spectrum, wherein generating the third spectrum comprises selecting, based on at least the second spectrum, at least one speech unit from a corpus comprising a plurality of speech units of the target speaker;

generating a replaced spectrum by replacing at least part of the second spectrum with at least part of the third spectrum; and

performing speech reconstruction based at least on the replaced spectrum.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, system and computer program product for voice conversion. The method includes performing speech analysis on the speech of a source speaker to achieve speech information; performing spectral conversion based on said speech information, to at least achieve a first spectrum similar to the speech of a target speaker; performing unit selection on the speech of said target speaker at least using said first spectrum as a target; replacing at least part of said first spectrum with the spectrum of the selected target speaker'"'"'s speech unit; and performing speech reconstruction at least based on the replaced spectrum.

11 Citations

View as Search Results

31 Claims

1. A voice conversion method comprising:
- performing speech analysis on speech of a source speaker to attain speech information comprising a first spectrum;
  
  converting the first spectrum to a second spectrum, wherein converting the first spectrum to the second spectrum comprises compensating for at least one spectral difference between the speech of the source speaker and speech of a target speaker;
  
  in response to converting the first spectrum to the second spectrum, generating a third spectrum, wherein generating the third spectrum comprises selecting, based on at least the second spectrum, at least one speech unit from a corpus comprising a plurality of speech units of the target speaker;
  
  generating a replaced spectrum by replacing at least part of the second spectrum with at least part of the third spectrum; and
  
  performing speech reconstruction based at least on the replaced spectrum.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method according to claim 1, wherein converting the first spectrum to the second spectrum comprises frequency warping.
  - 3. The method according to claim 1, wherein the speech information further comprises a first pitch contour, wherein the method further comprises:
    - converting the first pitch contour to a second pitch contour, wherein converting the first pitch contour comprises compensating for at least one pitch difference between the speech of the source speaker and the speech of the target speaker;
      
      wherein selecting the at least one speech unit from the corpus is based on at least the second spectrum and the second pitch contour; and
      
      wherein performing the speech reconstruction is based at least on the replaced spectrum and the second pitch contour.
  - 4. The method according to claim 1, wherein generating the replaced spectrum comprises:
    - replacing a part of said second spectrum higher than a specific frequency with the at least part of the third spectrum; and
      
      keeping a part of said second spectrum lower than said specific frequency unchanged.
  - 5. The method according to claim 4, wherein said specific frequency is between 500 Hz and 2000 Hz.
  - 6. The method according to claim 1, further comprising:
    - smoothing the replaced spectrum before performing the speech reconstruction.
  - 7. The method according to claim 1, wherein said speech information comprises pitch contour information.
  - 8. The method of claim 1, wherein generating the replaced spectrum involves replacing only part of the second spectrum with the at least part of the third spectrum.

9. A voice conversion system comprising:
- speech analysis means for performing speech analysis on speech of a source speaker to attain speech information comprising a first spectrum;
  
  spectral conversion means for converting the first spectrum to a second spectrum, wherein converting the first spectrum to the second spectrum comprises compensating for at least one spectral difference between the speech of the source speaker and speech of a target speaker;
  
  unit selection means for, in response to the converting of the first spectrum to the second spectrum, generating a third spectrum, wherein generating the third spectrum comprises selecting, based on at least the second spectrum, at least one speech unit from a corpus comprising a plurality of speech units of the target speaker;
  
  spectrum replacement means for generating a replaced spectrum by replacing at least part of said second spectrum with at least part of the third spectrum; and
  
  speech reconstruction means for performing speech reconstruction based at least on the replaced spectrum.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system according to claim 9, wherein said spectral conversion means converts the first spectrum to the second spectrum using at least frequency warping.
  - 11. The system according to claim 9, wherein the speech information further comprises a first pitch contour, the system further comprising:
    - prosodic conversion means for converting the first pitch contour to a second pitch contour, wherein converting the first pitch contour comprises compensating for at least one pitch difference between the speech of the source speaker and the speech of the target speaker;
      
      wherein said unit selection means selects the at least one speech unit from the corpus based at least on the second spectrum and the second pitch contour; and
      
      wherein said speech reconstruction means performs speech reconstruction based at least on the replaced spectrum and the second pitch contour.
  - 12. The system according to claim 9, wherein said spectrum replacement means:
    - replaces a part of said second spectrum higher than a specific frequency with the at least part of the third spectrum; and
      
      keeps a part of said second spectrum lower than said specific frequency unchanged.
  - 13. The system according to claim 12, wherein said specific frequency is between 500 Hz and 2000 Hz.
  - 14. The system according to claim 9, further comprising:
    - spectrum smoothing means for smoothing the replaced spectrum to generate a smoothed replaced spectrum; and
      
      wherein said speech reconstruction means performs speech reconstruction based on the smoothed replaced spectrum.
  - 15. The system according to claim 9, wherein said speech information comprises pitch contour information.
  - 16. The system according to claim 9, wherein the spectrum replacement means replaces only part of the second spectrum with the at least part of the third spectrum.

17. A computer readable storage device comprising computer readable instructions which, when executed by at least one processor, cause performance of a voice conversion method comprising:
- performing speech analysis on speech of a source speaker to attain speech information comprising a first spectrum;
  
  converting the first spectrum to a second spectrum, wherein converting the first spectrum to the second spectrum comprises compensating for at least one spectral difference between the speech of the source speaker and speech of a target speaker;
  
  in response to converting the first spectrum to the second spectrum, generating a third spectrum, wherein generating the third spectrum comprises selecting, based on at least the second spectrum, at least one speech unit from a corpus comprising a plurality of speech units of the target speaker;
  
  generating a replaced spectrum by replacing at least part of the second spectrum with at least part of the third spectrum; and
  
  performing speech reconstruction based at least on the replaced spectrum.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The computer readable storage device of claim 17, wherein converting the first spectrum to the second spectrum comprises frequency warping.
  - 19. The computer readable storage device of claim 17, wherein the speech information further comprises a first pitch contour, wherein the method further comprises:
    - converting the first pitch contour to a second pitch contour, wherein converting the first pitch contour comprises compensating for at least one pitch difference between the speech of the source speaker and the speech of the target speaker;
      
      wherein selecting the at least one speech unit from the corpus is based on at least the second spectrum and the second pitch contour; and
      
      wherein performing the speech reconstruction is based at least on the replaced spectrum and the second pitch contour.
  - 20. The computer readable storage device of claim 17, wherein generating the replaced spectrum comprises:
    - replacing a part of said second spectrum higher than a specific frequency with the at least part of the third spectrum; and
      
      keeping a part of said second spectrum lower than said specific frequency unchanged.
  - 21. The computer readable storage device of claim 20, wherein said specific frequency is between 500 Hz and 2000 Hz.
  - 22. The computer readable storage device of claim 17, wherein the method further comprises:
    - smoothing the replaced spectrum before performing the speech reconstruction.
  - 23. The computer readable storage device of claim 17, wherein said speech information comprises pitch contour information.

24. A voice conversion system comprising:
- a speech analyzer configured to perform speech analysis on speech of a source speaker to attain speech information comprising a first spectrum;
  
  a spectral converter configured to convert the first spectrum to a second spectrum, wherein converting the first spectrum to the second spectrum comprises compensating for at least one spectral difference between the speech of the source speaker and speech of a target speaker;
  
  a unit selector configured to, in response to conversion of the first spectrum to the second spectrum, generate a third spectrum, wherein generating the third spectrum comprises selecting, based on at least the second spectrum, at least one speech unit from a corpus comprising a plurality of speech units of the target speaker;
  
  a spectrum generator configured to generate a replaced spectrum by replacing at least part of said second spectrum with at least part of the third spectrum; and
  
  a speech reconstructor configured to perform speech reconstruction based at least on the replaced spectrum.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31)
- - 25. The system according to claim 24, wherein said spectral converter is configured to convert the first spectrum to the second spectrum using at least frequency warping.
  - 26. The system according to claim 24, wherein the speech information further comprises a first pitch contour, the system further comprising:
    - a prosodic converter configured to convert the first pitch contour to a second pitch contour, wherein converting the first pitch contour comprises compensating for at least one pitch difference between the speech of the source speaker and the speech of the target speaker;
      
      wherein said unit selector selects the at least one speech unit from the corpus based at least on the second spectrum and the second pitch contour; and
      
      wherein said speech reconstructor performs speech reconstruction based at least on the replaced spectrum and the second pitch contour.
  - 27. The system according to claim 24, wherein said spectrum generator is configured to:
    - replace a part of said second spectrum higher than a specific frequency with the at least part of the third spectrum; and
      
      keep a part of said second spectrum lower than said specific frequency unchanged.
  - 28. The system according to claim 27, wherein said specific frequency is between 500 Hz and 2000 Hz.
  - 29. The system according to claim 24, further comprising:
    - a spectrum smoother configured to smooth the replaced spectrum to create a smoothed replaced spectrum; and
      
      wherein said speech reconstructor performs speech reconstruction based on the smoothed replaced spectrum.
  - 30. The system according to claim 24, wherein said speech information comprises pitch contour information.
  - 31. The voice conversion system of claim 24, wherein the spectrum generator is configured to replace only part of the second spectrum with the at least part of the third spectrum.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Meng, Fan Ping, Qin, Yong, Shi, Qin, Shuang, Zhi Wei
Primary Examiner(s)
Azad, Abul

Application Number

US12/240,148
Publication Number

US 20090089063A1
Time in Patent Office

1,401 Days
Field of Search

704205-209
US Class Current

704/209
CPC Class Codes

G10L 2021/0135 Voice conversion or morphing

G10L 21/00 Speech or voice signal proc...

Voice conversion method and system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

11 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Voice conversion method and system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links