Voice conversion system and methodology
First Claim
1. A method of transforming a source signal representing a source voice into a target signal representing a target voice, said method comprising the machine-implemented steps of:
- preprocessing said source signal to produce a source signal segment;
comparing the source signal segment with a plurality of source codebook entries representing speech units in said source voice to produce therefrom a plurality of corresponding weights;
transforming the source signal segment into a target signal segment based on the plurality of weights and a plurality of target codebook entries representing speech units in said target voice, said target codebook entries corresponding to the plurality of source codebook entries; and
post processing the target signal segment to generate said target signal.
4 Assignments
0 Petitions
Accused Products
Abstract
A voice conversion system employs a codebook mapping approach to transforming a source voice to sound like a target voice. Each speech frame is represented by a weighted average of codebook entries. The weights represent a perceptual distance of the speech frame and may be refined by a gradient descent analysis. The vocal tract characteristics, represented by a line spectral frequency vector, the excitation characteristics, represented by a linear predictive coding residual, the duration, and the amplitude of the speech frame are transformed in the same weighted-average framework.
85 Citations
30 Claims
-
1. A method of transforming a source signal representing a source voice into a target signal representing a target voice, said method comprising the machine-implemented steps of:
-
preprocessing said source signal to produce a source signal segment;
comparing the source signal segment with a plurality of source codebook entries representing speech units in said source voice to produce therefrom a plurality of corresponding weights;
transforming the source signal segment into a target signal segment based on the plurality of weights and a plurality of target codebook entries representing speech units in said target voice, said target codebook entries corresponding to the plurality of source codebook entries; and
post processing the target signal segment to generate said target signal. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
converting the source signal segment into a plurality of line spectral frequencies; and
comparing the plurality of line spectral frequencies with the plurality of the source code entries to produce therefrom the plurality of the respective weights, wherein each of the source code entries include a respective plurality of line spectral frequencies.
-
-
6. A method as in claim 5, wherein the step of converting the source signal segment includes the steps of:
-
determining a plurality of coefficients for the source signal segment; and
converting the plurality of coefficients into the plurality of line spectral frequencies.
-
-
7. A method as in claim 6, wherein the step of determining a plurality of coefficients includes the step of determining a plurality of linear prediction coefficients or PARCOR coefficients.
-
8. A method as in claim 5, wherein the step of comparing the plurality of line spectral frequencies includes the steps of:
-
computing a plurality of distances between the source signal segment, represented by the plurality of line spectral frequencies, and each of the plurality of the respective source code entries, represented by a respective plurality of line spectral frequencies; and
producing the plurality of the weights based on the plurality of respective distances.
-
-
9. A method as in claim 8, further including the step of refining the plurality of weights by a gradient descent method.
-
10. A method as in claim 1, wherein the step of transforming the source signal segment into a target signal segment based on the plurality of weights and a plurality of target codebook entries includes the step of transforming vocal tract characteristics of the source signal segment into the target signal segment based on the plurality of weights and a plurality of target codebook entries.
-
11. A method as in claim 10, wherein the step of transforming vocal tract characteristics includes the step of reducing formant bandwidths in the target signal segment.
-
12. A method as in claim 10, wherein the step of transforming the source signal segment into a target signal segment based on the plurality of weights and a plurality of target codebook entries includes the step of transforming excitation characteristics of the source signal segment into the target signal segment based on the plurality of weights.
-
13. A method as in claim 1, further including the step of modifying the prosody of the target signal segment based on the plurality of weights.
-
14. A method as in claim 13, wherein the step of modifying the prosody of the target signal segment based on the plurality of weights includes the step of modifying the duration of the target signal segment.
-
15. A method as in claim 13, wherein the step of modifying the prosody of the target signal segment based on the plurality of weights includes the step of modifying the stress of the target signal segment.
-
16. A computer-readable medium bearing instructions for transforming a source signal representing a source voice into a target signal representing a target voice, said instructions arranged, when executed, to cause one or more processors to perform the steps of:
-
preprocessing said source signal to produce a source signal segment;
comparing the source signal segment with a plurality of source codebook entries representing speech units in said source voice to produce therefrom a plurality of corresponding weights;
transforming the source signal segment into a target signal segment based on the plurality of weights and a plurality of target codebook entries representing speech units in said target voice, said target codebook entries corresponding to the plurality of source codebook entries; and
post processing the target signal segment to generate said target signal. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
converting the source signal segment into a plurality of line spectral frequencies; and
comparing the plurality of line spectral frequencies with the plurality of the source code entries to produce therefrom the plurality of the respective weights, wherein each of the source code entries include a respective plurality of line spectral frequencies.
-
-
21. A computer-readable medium as in claim 20, wherein the step of converting the source signal segment includes the steps of:
-
determining a plurality of coefficients for the source signal segment; and
converting the plurality of coefficients into the plurality of line spectral frequencies.
-
-
22. A computer-readable medium as in claim 21, wherein the step of determining a plurality of coefficients includes the step of determining a plurality of linear prediction coefficients or PARCOR coefficients.
-
23. A computer-readable medium as in claim 20, wherein the step of comparing the plurality of line spectral frequencies includes the steps of:
-
computing a plurality of distances between the source signal segment, represented by the plurality of line spectral frequencies, and each of the plurality of the respective source code entries, represented by a respective plurality of line spectral frequencies; and
producing the plurality of the weights based on the plurality of respective distances.
-
-
24. A computer-readable medium as in claim 23, further including the step of refining the plurality of the weight by a gradient descent method.
-
25. A computer-readable medium as in claim 16, wherein the step of transforming the source signal segment into a target signal segment based on the plurality of weights and a plurality of target codebook entries includes the step of transforming vocal tract characteristics of the source signal segment into the target signal segment based on the plurality of weights and a plurality of target codebook entries.
-
26. A computer-readable medium as in claim 25, wherein the step of transforming vocal tract characteristics includes the step of reducing formant bandwidths in the target signal segment.
-
27. A computer-readable medium as in claim 25, wherein the step of transforming the source signal segment into a target signal segment based on the plurality of weights and a plurality of target codebook entries includes the step of transforming excitation characteristics of the source signal segment into the target signal segment based on the plurality of weights.
-
28. A computer-readable medium as in claim 16, wherein the instructions, when executed, are further arranged to perform the step of modifying the prosody of the target signal segment based on the plurality of weights.
-
29. A computer-readable medium as in claim 28, wherein the step of modifying the prosody of the target signal segment based on the plurality of weights includes the step of modifying the duration of the target signal segment.
-
30. A computer-readable medium as in claim 28, wherein the step of modifying the prosody of the target signal segment based on the plurality of weights includes the step of modifying the stress of the target signal segment.
Specification