Voice quality conversion device and voice quality conversion method for converting voice quality of an input speech using target vocal tract information and received vocal tract information corresponding to the input speech
First Claim
1. A voice quality conversion device that converts voice quality of an input speech using information corresponding to the input speech, said voice quality conversion device comprising:
- a target vowel vocal tract information hold unit configured to hold target vowel vocal tract information of each vowel, the target vowel vocal tract information indicating target voice quality;
a vowel conversion unit configured to(i) receive vocal tract information with phoneme boundary information which is vocal tract information that corresponds to the input speech and that is added with information of (1) a phoneme in the input speech and (2) a duration of the phoneme,(ii) approximate, as a first polynomial expression, a temporal change of received vocal tract information of a vowel included in the received vocal tract information with phoneme boundary information,(iii) approximate, as a second polynomial expression, a temporal change of target vocal tract information of the vowel, the target vocal tract information being included in the target vowel vocal tract information held in said target vowel vocal tract information hold unit,(iv) approximate, as a third polynomial expression, interpolated vocal tract information of the vowel by combining (1) the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel with (2) the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel, and(v) convert the received vocal tract information of the vowel using the third polynomial expression approximating the interpolated vocal tract information of the vowel; and
a synthesis unit configured to synthesize a speech using the converted vocal tract information of the vowel converted by said vowel conversion unit,wherein (i) the first polynomial expression approximates a change in the received vocal tract information of the vowel over time, (ii) the second polynomial expression approximates a change in the target vocal tract information of the vowel over time, and (iii) the third polynomial expression approximates a change in the interpolated vocal tract information of the vowel over time,wherein the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel and the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel have a same time period that overlaps over the entire time period of the vowel, andwherein said vowel conversion unit is configured to generate the third polynomial expression by adding the first polynomial expression with the second polynomial expression based on a predetermined conversion ratio.
4 Assignments
0 Petitions
Accused Products
Abstract
A voice quality conversion device including: a target vowel vocal tract information hold unit holding target vowel vocal tract information of each vowel indicating target voice quality; a vowel conversion unit (i) receiving vocal tract information with phoneme boundary information of the speech including information of phonemes and phoneme durations, (ii) approximating a temporal change of vocal tract information of a vowel in the vocal tract information with phoneme boundary information applying a first function, (iii) approximating a temporal change of vocal tract information of the same vowel held in the target vowel vocal tract information hold unit applying a second function, (iv) calculating a third function by combining the first function with the second function, and (v) converting the vocal tract information of the vowel applying the third function; and a synthesis unit synthesizing a speech using the converted information.
-
Citations
21 Claims
-
1. A voice quality conversion device that converts voice quality of an input speech using information corresponding to the input speech, said voice quality conversion device comprising:
-
a target vowel vocal tract information hold unit configured to hold target vowel vocal tract information of each vowel, the target vowel vocal tract information indicating target voice quality; a vowel conversion unit configured to (i) receive vocal tract information with phoneme boundary information which is vocal tract information that corresponds to the input speech and that is added with information of (1) a phoneme in the input speech and (2) a duration of the phoneme, (ii) approximate, as a first polynomial expression, a temporal change of received vocal tract information of a vowel included in the received vocal tract information with phoneme boundary information, (iii) approximate, as a second polynomial expression, a temporal change of target vocal tract information of the vowel, the target vocal tract information being included in the target vowel vocal tract information held in said target vowel vocal tract information hold unit, (iv) approximate, as a third polynomial expression, interpolated vocal tract information of the vowel by combining (1) the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel with (2) the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel, and (v) convert the received vocal tract information of the vowel using the third polynomial expression approximating the interpolated vocal tract information of the vowel; and a synthesis unit configured to synthesize a speech using the converted vocal tract information of the vowel converted by said vowel conversion unit, wherein (i) the first polynomial expression approximates a change in the received vocal tract information of the vowel over time, (ii) the second polynomial expression approximates a change in the target vocal tract information of the vowel over time, and (iii) the third polynomial expression approximates a change in the interpolated vocal tract information of the vowel over time, wherein the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel and the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel have a same time period that overlaps over the entire time period of the vowel, and wherein said vowel conversion unit is configured to generate the third polynomial expression by adding the first polynomial expression with the second polynomial expression based on a predetermined conversion ratio. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A voice quality conversion method of converting voice quality of an input speech using information corresponding to the input speech, said voice quality conversion method comprising:
-
receiving vocal tract information with phoneme boundary information which is vocal tract information that corresponds to the input speech and that is added with information of (i) a phoneme in the input speech and (ii) a duration of the phoneme; approximating, as a first polynomial expression, a temporal change of received vocal tract information of a vowel included in the received vocal tract information with phoneme boundary information; approximating, as a second polynomial expression, a temporal change of target vocal tract information of the vowel, the target vocal tract information of the vowel indicating target voice quality; approximating, as a third polynomial expression, interpolated vocal tract information of the vowel by combining (i) the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel with (ii) the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel; converting the received vocal tract information of the vowel using the third polynomial expression approximating the interpolated vocal tract information of the vowel; and synthesizing a speech using the converted vocal tract information of the vowel converted in said converting, wherein (i) the first polynomial expression approximates a change in the received vocal tract information of the vowel over time, (ii) the second polynomial expression approximates a change in the target vocal tract information of the vowel over time, and (iii) the third polynomial expression approximates a change in the interpolated vocal tract information of the vowel over time, wherein the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel and the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel have a same time period that overlaps over the entire time period of the vowel, and wherein said approximating, as the third polynomial expression, the interpolated vocal tract information of the vowel includes generating the third polynomial expression by adding the first polynomial expression with the second polynomial expression based on a predetermined conversion ratio.
-
-
19. A non-transitory computer readable recording medium having stored thereon a program for converting voice quality of an input speech using information corresponding to the input speech, wherein, when executed by a computer, said program causes the computer to perform a method comprising:
-
receiving vocal tract information with phoneme boundary information which is vocal tract information that corresponds to the input speech and that is added with information of (i) a phoneme in the input speech and (ii) a duration of the phoneme; approximating, as a first polynomial expression, a temporal change of received vocal tract information of a vowel included in the received vocal tract information with phoneme boundary information; approximating, as a second polynomial expression, a temporal change of target vocal tract information of the vowel, the target vocal tract information of the vowel indicating target voice quality; approximating, as a third polynomial expression, interpolated vocal tract information by combining (i) the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel with (ii) the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel; and converting the received vocal tract information of the vowel using the third polynomial expression approximating the interpolated vocal tract information of the vowel; and synthesizing a speech using the converted vocal tract information of the vowel converted in said converting, wherein (i) the first polynomial expression approximates a change in the received vocal tract information of the vowel over time, (ii) the second polynomial expression approximates a change in the target vocal tract information of the vowel over time, and (iii) the third polynomial expression approximates a change in the interpolated vocal tract information of the vowel over time, wherein the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel and the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel have a same time period that overlaps over the entire time period of the vowel, and wherein said approximating, as the third polynomial expression, the interpolated vocal tract information of the vowel includes generating the third polynomial expression by adding the first polynomial expression with the second polynomial expression based on a predetermined conversion ratio.
-
-
20. A voice quality conversion system that converts voice quality of an original speech to be converted using information corresponding to the original speech, said voice quality conversion system comprising:
-
a server; and a terminal connected to said server via a network, wherein said server includes; a target vowel vocal tract information hold unit configured to hold target vowel vocal tract information of each vowel, the target vowel vocal tract information indicating target voice quality; a target vowel vocal tract information sending unit configured to send the target vowel vocal tract information held in said target vowel vocal tract information hold unit to said terminal via the network; an original speech hold unit configured to hold original speech information that is information corresponding to the original speech; and an original speech information sending unit configured to send the original speech information held in said original speech hold unit to said terminal via the network, wherein said terminal includes; a target vowel vocal tract information receiving unit configured to receive the target vowel vocal tract information from said target vowel vocal tract information sending unit; an original speech information receiving unit configured to receive the original speech information from said original speech information sending unit; a vowel conversion unit configured to (i) approximate, as a first polynomial expression, a temporal change of received vocal tract information of a vowel included in the received original speech information received by said original speech information receiving unit, (ii) approximate a second polynomial expression, a temporal change of target vocal tract information for the vowel, the target vocal tract information for the vowel being included in the target vowel vocal tract information received by said target vowel vocal tract information receiving unit, (iii) approximate, as a third polynomial expression, interpolated vocal tract information by combining (i) the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel with (ii) the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel, and (iv) convert the vocal tract information of the vowel using the third polynomial expression approximating the interpolated vocal tract information; and a synthesis unit configured to synthesize a speech using the converted vocal tract information of the vowel converted by said vowel conversion unit, wherein (i) the first polynomial expression approximates a change in the received vocal tract information of the vowel over time, (ii) the second polynomial expression approximates a change in the target vocal tract information of the vowel over time, and (iii) the third polynomial expression approximates a change in the interpolated vocal tract information of the vowel over time, wherein the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel and the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel have a same time period that overlaps over the entire time period of the vowel, and wherein said vowel conversion unit is configured to generate the third polynomial expression by adding the first polynomial expression with the second polynomial expression based on a predetermined conversion ratio.
-
-
21. A voice quality conversion system that converts voice quality of an original speech to be converted using information corresponding to the original speech, said voice quality conversion system comprising:
-
a terminal; and a server connected to said terminal via a network, wherein said terminal includes; a target vowel vocal tract information generation unit configured to generate target vowel vocal tract information of each vowel, the target vowel vocal tract information indicating target voice quality; a target vowel vocal tract information sending unit configured to send the target vowel vocal tract information generated by said target vowel vocal tract information generation unit to said server via the network; a voice quality conversion speech receiving unit configured to receive a speech with converted voice quality; and a reproduction unit configured to reproduce the speech with the converted voice quality received by said voice quality conversion speech receiving unit, wherein said server includes; an original speech hold unit configured to hold original speech information that is information corresponding to the original speech; a target vowel vocal tract information receiving unit configured to receive the target vowel vocal tract information from said target vowel vocal tract information sending unit; a vowel conversion unit configured to (i) approximate, as a first polynomial expression, a temporal change of received vocal tract information of a vowel included in the original speech information held in said original speech information hold unit, (ii) approximate, as a second polynomial expression, a temporal change of target vocal tract information of the vowel, the target vocal tract information being included in the target vowel vocal tract information received by said target vowel vocal tract information receiving unit, (iii) approximate, as a third polynomial expression, interpolated vocal tract information by combining (i) the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel with (ii) the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel, and (iv) convert the received vocal tract information of the vowel using the third polynomial expression approximating the interpolated vocal tract information; a synthesis unit configured to synthesize a speech using the converted vocal tract information for the vowel converted by said vowel conversion unit; and a synthetic speech sending unit configured to send, as the speech with the converted voice quality, the speech synthesized by said synthesis unit to said voice quality conversion speech receiving unit via the network, wherein (i) the first polynomial expression approximates a change in the received vocal tract information of the vowel over time, (ii) the second polynomial expression approximates a change in the target vocal tract information of the vowel over time, and (iii) the third polynomial expression approximates a change in the interpolated vocal tract information of the vowel over time, wherein the first polynomial expression approximating the temporal change of the received vocal tract information of the vowel and the second polynomial expression approximating the temporal change of the target vocal tract information of the vowel have a same time period that overlaps over the entire time period of the vowel, and wherein said vowel conversion unit is configured to generate the third polynomial expression by adding the first polynomial expression with the second polynomial expression based on a predetermined conversion ratio.
-
Specification