Soft alignment based on a probability of time alignment
First Claim
Patent Images
1. A method comprising:
- receiving a first sequence of feature vectors associated with a source speaker for processing based on operations controlled by a processor;
receiving a second sequence of feature vectors associated with a target speaker;
generating a third sequence of joint feature vectors, wherein the generation of each joint feature vector is based on;
a first vector from the first sequence;
a first vector from the second sequence; and
a first probability value representing the probability that the first vector from the first sequence and the first vector from the second sequence are time aligned to the same feature in their respective sequences; and
applying the third sequence of joint feature vectors as a part of a voice conversion process.
4 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods are provided for performing soft alignment in Gaussian mixture model (GMM) based and other vector transformations. Soft alignment may assign alignment probabilities to source and target feature vector pairs. The vector pairs and associated probabilities may then be used calculate a conversion function, for example, by computing GMM training parameters from the joint vectors and alignment probabilities to create a voice conversion function for converting speech sounds from a source speaker to a target speaker.
-
Citations
39 Claims
-
1. A method comprising:
-
receiving a first sequence of feature vectors associated with a source speaker for processing based on operations controlled by a processor; receiving a second sequence of feature vectors associated with a target speaker; generating a third sequence of joint feature vectors, wherein the generation of each joint feature vector is based on; a first vector from the first sequence; a first vector from the second sequence; and a first probability value representing the probability that the first vector from the first sequence and the first vector from the second sequence are time aligned to the same feature in their respective sequences; and applying the third sequence of joint feature vectors as a part of a voice conversion process. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more computer readable media storing computer-executable instructions which, when executed by a processor, cause the processor to perform a method comprising:
-
receiving a first sequence of feature vectors associated with a source speaker; receiving a second sequence of feature vectors associated with a target speaker; generating a third sequence of joint feature vectors, wherein each joint feature vector is based on; a first vector from the first sequence; a second vector from the second sequence; and a probability value representing the probability that the first vector and the second vector are time aligned to the same feature in their respective sequences; and applying the third sequence feature vectors as a part of a voice conversion process. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method comprising:
-
receiving, a first data sequence associated with a first source speaker for processing based on operations control by a processor, receiving a second data sequence associated with a second source speaker; identifying plurality of data pairs, each data pair comprising an item from the first data sequence and an item from the second data sequence; determining a plurality of alignment probabilities, each alignment probability associated with one of the plurality of data pairs and comprising a probability value that the item from the first data sequence is time aligned with the item from the second data sequence; determining a data transformation function based on the plurality of data pairs and the associated plurality of alignment probabilities; and applying the data transformation function as a part of a voice conversion process. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. An apparatus comprising:
-
a memory configured to store instructions; and a processor configured to process the instructions to perform a method comprising; receiving a first sequence of feature vectors associated with a source speaker; receiving a second sequence of feature vectors associated with a target speaker; generating a third sequence of joint feature vectors, wherein the generation of each joint feature vector is based on; a first vector from the first sequence; a first vector from the second sequence; and a first probability value representing the probability that the first vector from the first sequence and the first vector from the second sequence are time aligned to the same feature in their respective sequences; and applying the third sequence of joint feature vectors as a part of a voice conversion process. - View Dependent Claims (22, 23, 24, 25, 26, 27)
-
-
28. One or more computer readable media storing computer-executable instructions which, when executed by a processor, cause the processor to perform a method comprising:
-
receiving a first data sequence associated with a first source speaker; receiving a second data sequence associated with a second source speaker; identifying a plurality of data pairs, each data pair comprising an item from the first data sequence and an item from the second data sequence; determining a plurality of alignment probabilities, each alignment probability associated with one of the plurality of data pairs and comprising a probability value that the item from the first data sequence is time aligned with the item from the second data sequence; determining a data transformation function based on the plurality of data pairs and the associated plurality of alignment probabilities; and applying the data transformation function as a part of a voice conversion process. - View Dependent Claims (29, 30, 31, 32, 33)
-
-
34. An apparatus comprising:
-
a memory configured to store instructions; and a processor configured to process the instructions to perform a method comprising; receiving a first data sequence associated with a first source speaker; receiving a second data sequence associated with a second source speaker; identifying a plurality of data pairs, each data pair comprising an item from the first data sequence and an item from the second data sequence; determining a plurality of alignment probabilities, each alignment probability associated with one of the plurality of data pairs and comprising a probability value that the item from the first data sequence is aligned with the item from the second data sequence; determining a data transformation function based on the plurality of data pairs and the associated plurality of alignment probabilities; and applying the data transformation function as a part of a voice conversion process. - View Dependent Claims (35, 36, 37, 38, 39)
-
Specification