System and method for voice-to-voice conversion
First Claim
1. A method of building a speech conversion system using target voice information from a target voice, and speech data that represents a speech segment of a source voice, the method comprising:
- receiving source speech data that represents a first speech segment of a source voice;
receiving target timbre data relating to the target voice, the target timbre data being within a timbre space;
using a generative machine learning system to produce first candidate speech data that represents a first candidate speech segment in a first candidate voice as a function of the source speech data and the target timbre data;
using a discriminative machine learning system to compare the first candidate speech data to the target timbre data with reference to timbre data of a plurality of different voices,said using the discriminative machine learning system comprising determining at least one inconsistency between the first candidate speech data and the target timbre data with reference to the timbre data of the plurality of different voices, the discriminative machine learning system producing an inconsistency message having information relating to the inconsistency between the first candidate speech data and the target timbre data;
feeding back the inconsistency message to the generative machine learning system;
using the generative machine learning system to produce second candidate speech data, that represents a second candidate speech segment in a second candidate voice, as a function of the inconsistency message; and
refining the target timbre data in the timbre space using information produced by the generative machine learning system and/or discriminative machine learning system as a result of said feeding back.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of building a speech conversion system uses target information from a target voice and source speech data. The method receives the source speech data and the target timbre data, which is within a timbre space. A generator produces first candidate data as a function of the source speech data and the target timbre data. A discriminator compares the first candidate data to the target timbre data with reference to timbre data of a plurality of different voices. The discriminator determines inconsistencies between the first candidate data and the target timbre data. The discriminator produces an inconsistency message containing information relating to the inconsistencies. The inconsistency message is fed back to the generator, and the generator produces a second candidate data. The target timbre data in the timbre space is refined using information produced by the generator and/or discriminator as a result of the feeding back.
75 Citations
37 Claims
-
1. A method of building a speech conversion system using target voice information from a target voice, and speech data that represents a speech segment of a source voice, the method comprising:
-
receiving source speech data that represents a first speech segment of a source voice; receiving target timbre data relating to the target voice, the target timbre data being within a timbre space; using a generative machine learning system to produce first candidate speech data that represents a first candidate speech segment in a first candidate voice as a function of the source speech data and the target timbre data; using a discriminative machine learning system to compare the first candidate speech data to the target timbre data with reference to timbre data of a plurality of different voices, said using the discriminative machine learning system comprising determining at least one inconsistency between the first candidate speech data and the target timbre data with reference to the timbre data of the plurality of different voices, the discriminative machine learning system producing an inconsistency message having information relating to the inconsistency between the first candidate speech data and the target timbre data; feeding back the inconsistency message to the generative machine learning system; using the generative machine learning system to produce second candidate speech data, that represents a second candidate speech segment in a second candidate voice, as a function of the inconsistency message; and refining the target timbre data in the timbre space using information produced by the generative machine learning system and/or discriminative machine learning system as a result of said feeding back. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for training a speech conversion system, the system comprising:
-
source speech data that represents a first speech segment of a source voice; target timbre data that relates to a target voice; a generative machine learning system configured to produce first candidate speech data that represents a first candidate speech segment in a first candidate voice as a function of the source speech data and the target timbre data; a discriminative machine learning system configured to; compare the first candidate speech data to the target timbre data with reference to timbre data of a plurality of different voices, and determine whether there is at least one inconsistency between the first candidate speech data and the target timbre data with reference to the timbre data of the plurality of different voices, and when the at least one inconsistency exists; produce an inconsistency message having information relating to the inconsistency between the first candidate speech data and the target timbre data, and provide the inconsistency message back to the generative machine learning system. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. A computer program product for use on a computer system for training a speech conversion system using source speech data that represents a speech segment from a source voice for conversion into an output voice having a target voice timbre, the computer program product comprising a tangible, non-transient computer usable medium having computer readable program code thereon, the computer readable program code comprising:
-
program code for causing a generative machine learning system to produce first candidate speech data that represents a first candidate speech segment in a first candidate voice as a function of the source speech data and target timbre data; program code for causing a discriminative machine learning system to compare the first candidate speech data to the target timbre data with reference to the timbre data of the plurality of different voices; program code for causing the discriminative machine learning system to determine at least one inconsistency between the first candidate speech data and the target timbre data with reference to the timbre data of the plurality of different voices; program code for causing the discriminative machine learning system to produce an inconsistency message having information relating to the inconsistency between the first candidate speech data and the target timbre data with reference to the timbre data of the plurality of different voices; program code for causing the discriminative machine learning system to feed the inconsistency message back to the generative machine learning system; and program code for causing the generative machine learning system to produce second candidate speech data representing a second candidate speech segment in a second candidate voice as a function of the inconsistency message. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37)
-
Specification