Soft alignment based on a probability of time alignment

US 7,505,950 B2
Filed: 04/26/2006
Issued: 03/17/2009
Est. Priority Date: 04/26/2006
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving a first sequence of feature vectors associated with a source speaker for processing based on operations controlled by a processor;

receiving a second sequence of feature vectors associated with a target speaker;

generating a third sequence of joint feature vectors, wherein the generation of each joint feature vector is based on;

a first vector from the first sequence;

a first vector from the second sequence; and

a first probability value representing the probability that the first vector from the first sequence and the first vector from the second sequence are time aligned to the same feature in their respective sequences; and

applying the third sequence of joint feature vectors as a part of a voice conversion process.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are provided for performing soft alignment in Gaussian mixture model (GMM) based and other vector transformations. Soft alignment may assign alignment probabilities to source and target feature vector pairs. The vector pairs and associated probabilities may then be used calculate a conversion function, for example, by computing GMM training parameters from the joint vectors and alignment probabilities to create a voice conversion function for converting speech sounds from a source speaker to a target speaker.

Citations

39 Claims

1. A method comprising:
- receiving a first sequence of feature vectors associated with a source speaker for processing based on operations controlled by a processor;
  
  receiving a second sequence of feature vectors associated with a target speaker;
  
  generating a third sequence of joint feature vectors, wherein the generation of each joint feature vector is based on;
  
  a first vector from the first sequence;
  
  a first vector from the second sequence; and
  
  a first probability value representing the probability that the first vector from the first sequence and the first vector from the second sequence are time aligned to the same feature in their respective sequences; and
  
  applying the third sequence of joint feature vectors as a part of a voice conversion process.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the first sequence contains a different number of feature vectors than the second sequence.
  - 3. The method of claim 1, wherein the first sequence corresponds to a plurality of utterances produced by a first speaker, and the second sequence corresponds to the same plurality of utterances produced by a second speaker, and wherein each of the feature vectors represents a basic speech sound in a larger voice segment.
  - 4. The method of claim 1, wherein a Hidden Markov Model is applied to estimate the first probability value.
  - 5. The method of claim 1, wherein the probability is a non-Boolean value.
  - 6. The method of claim 1, wherein for the generation of the third sequence of joint feature vectors, the vector from the first sequence and the vector from the second sequence are different vectors for each joint feature vector in the third sequence.
  - 7. The method of claim 1, wherein the generation of at least one of the joint feature vectors is further based on:
    - a second vector from the first sequence;
      
      a second vector from the second sequence; and
      
      a second probability value representing the probability that the second vector from the first sequence and the second vector from the second sequence are aligned to the same feature in their respective sequences.

8. One or more computer readable media storing computer-executable instructions which, when executed by a processor, cause the processor to perform a method comprising:
- receiving a first sequence of feature vectors associated with a source speaker;
  
  receiving a second sequence of feature vectors associated with a target speaker;
  
  generating a third sequence of joint feature vectors, wherein each joint feature vector is based on;
  
  a first vector from the first sequence;
  
  a second vector from the second sequence; and
  
  a probability value representing the probability that the first vector and the second vector are time aligned to the same feature in their respective sequences; and
  
  applying the third sequence feature vectors as a part of a voice conversion process.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer readable media of claim 8, wherein the first sequence contains a different number of feature vectors than the second sequence.
  - 10. The computer readable media of claim 8, wherein the first sequence corresponds to a plurality of utterances produced by a first speaker, and the second sequence corresponds to the same plurality of utterances produced by a second speaker, and wherein each of the feature vectors represents a basic speech sound in a larger voice segment.
  - 11. The computer readable media of claim 8, wherein a Hidden Markov Model is applied to estimate the probability value.
  - 12. The computer readable media of claim 8, wherein the probability is a non-Boolean value.
  - 13. The computer readable media of claim 8, wherein for the generation of the third sequence of joint feature vectors, the vector from the first sequence and the vector from the second sequence are different vectors for each joint feature vector in the third sequence.
  - 14. The computer readable media of claim 8, wherein the generation of at least one of the joint feature vectors is further based on:
    - a second vector from the first sequence;
      
      a second vector from the second sequence; and
      
      a second probability value representing the probability that the second vector from the first sequence and the second vector from the second sequence are aligned to the same feature in their respective sequences.

15. A method comprising:
- receiving, a first data sequence associated with a first source speaker for processing based on operations control by a processor,receiving a second data sequence associated with a second source speaker;
  
  identifying plurality of data pairs, each data pair comprising an item from the first data sequence and an item from the second data sequence;
  
  determining a plurality of alignment probabilities, each alignment probability associated with one of the plurality of data pairs and comprising a probability value that the item from the first data sequence is time aligned with the item from the second data sequence;
  
  determining a data transformation function based on the plurality of data pairs and the associated plurality of alignment probabilities; and
  
  applying the data transformation function as a part of a voice conversion process.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, wherein determining the data transformation function comprises calculating parameters according to one of Gaussian Mixture Model (GMM) techniques and codebook-based techniques, said parameters associated with the data transformation.
  - 17. The method of claim 16, wherein calculation of the parameters comprises execution of an Expectation-Maximization algorithm.
  - 18. The method of claim 15, wherein at least one of the plurality of alignment probabilities is a non-Boolean value.
  - 19. The method of claim 15, wherein the first data sequence corresponds to a plurality of utterances produced by the first source speaker, the second data sequence corresponds to a plurality of utterances produced by the second source speaker, and the data transformation function comprises a voice conversion function and wherein each of the feature vectors represents a basic speech sound in a larger voice segment.
  - 20. The method of claim 19, further comprising:
    - receiving third data sequence associated with the first source speaker, said third data sequence corresponding to speech vectors produced based on sound provided by the first source speaker; and
      
      applying the voice conversion function to the third data sequence.

21. An apparatus comprising:
- a memory configured to store instructions; and
  
  a processor configured to process the instructions to perform a method comprising;
  
  receiving a first sequence of feature vectors associated with a source speaker;
  
  receiving a second sequence of feature vectors associated with a target speaker;
  
  generating a third sequence of joint feature vectors, wherein the generation of each joint feature vector is based on;
  
  a first vector from the first sequence;
  
  a first vector from the second sequence; and
  
  a first probability value representing the probability that the first vector from the first sequence and the first vector from the second sequence are time aligned to the same feature in their respective sequences; and
  
  applying the third sequence of joint feature vectors as a part of a voice conversion process.
- View Dependent Claims (22, 23, 24, 25, 26, 27)
- - 22. The apparatus of claim 21, wherein the first sequence contains a different number of feature vectors than the second sequence.
  - 23. The apparatus of claim 21, wherein the first sequence corresponds to a plurality of utterances produced by a first speaker, and the second sequence corresponds to the same plurality of utterances produced by a second speaker, and wherein each of the vectors represents a basic speech sound in a larger voice segment.
  - 24. The apparatus of claim 21, wherein a Hidden Markov Model is applied to estimate the first probability value.
  - 25. The apparatus of claim 21, wherein the probability is a non-Boolean value.
  - 26. The apparatus of claim 21, wherein for the generation of the third sequence of joint feature vectors, the vector from the first sequence and the vector from the second sequence are different vectors for each joint feature vector in the third sequence.
  - 27. The apparatus of claim 21, wherein the generation of at least one of the joint feature vectors is further based on:
    - a second vector from the first sequence;
      
      a second vector from the second sequence; and
      
      a second probability value representing the probability that the second vector from the first sequence and the second vector from the second sequence are time aligned to the same feature in their respective sequences.

28. One or more computer readable media storing computer-executable instructions which, when executed by a processor, cause the processor to perform a method comprising:
- receiving a first data sequence associated with a first source speaker;
  
  receiving a second data sequence associated with a second source speaker;
  
  identifying a plurality of data pairs, each data pair comprising an item from the first data sequence and an item from the second data sequence;
  
  determining a plurality of alignment probabilities, each alignment probability associated with one of the plurality of data pairs and comprising a probability value that the item from the first data sequence is time aligned with the item from the second data sequence;
  
  determining a data transformation function based on the plurality of data pairs and the associated plurality of alignment probabilities; and
  
  applying the data transformation function as a part of a voice conversion process.
- View Dependent Claims (29, 30, 31, 32, 33)
- - 29. The one or more computer readable media of claim 28, wherein determining the data transformation function comprises calculating parameters according to one of Gaussian Mixture Model (GMM) techniques and codebook-based techniques, said parameters associated with the data transformation.
  - 30. The one or more computer readable media of claim 29, wherein calculating of the parameters comprises execution of an Expectation-Maximization algorithm.
  - 31. The one or more computer readable media of claim 28, wherein at least one of the plurality of alignment probabilities is a non-Boolean value.
  - 32. The one or more computer readable media of claim 28, wherein the first data sequence corresponds to a plurality of utterances produced by the first source speaker, the second data sequence corresponds to a plurality of utterances produced by the second source speaker, and the data transformation function comprises a voice conversion function, and wherein each of the feature vectors represents a basic speech sound in a larger voice segment.
  - 33. The one or more computer readable media of claim 32, further comprising:
    - receiving third data sequence associated with the first source speaker, said third data sequence corresponding to speech vectors produced based on sound provided by the first source speaker; and
      
      applying the voice conversion function to the third data sequence.

34. An apparatus comprising:
- a memory configured to store instructions; and
  
  a processor configured to process the instructions to perform a method comprising;
  
  receiving a first data sequence associated with a first source speaker;
  
  receiving a second data sequence associated with a second source speaker;
  
  identifying a plurality of data pairs, each data pair comprising an item from the first data sequence and an item from the second data sequence;
  
  determining a plurality of alignment probabilities, each alignment probability associated with one of the plurality of data pairs and comprising a probability value that the item from the first data sequence is aligned with the item from the second data sequence;
  
  determining a data transformation function based on the plurality of data pairs and the associated plurality of alignment probabilities; and
  
  applying the data transformation function as a part of a voice conversion process.
- View Dependent Claims (35, 36, 37, 38, 39)
- - 35. The apparatus of claim 34, wherein determining the data transformation function comprises calculating parameters according to one of Gaussian Mixture Model (GMM) techniques and codebook-based techniques, said parameters associated with the data transformation.
  - 36. The apparatus of claim 35, wherein calculation of the parameters comprises execution of an Expectation-Maximization algorithm.
  - 37. The apparatus of claim 34, wherein at least one of the plurality of alignment probabilities is a non-Boolean value.
  - 38. The apparatus of claim 34, wherein the first data sequence corresponds to a plurality of utterances produced by a first source speaker, the second data sequence corresponds to a plurality of utterances produced by a second source speaker, and the data transformation function comprises a voice conversion function, and wherein each of the feature vectors represents a base speech sound in a larger voice segment.
  - 39. The apparatus of claim 38, wherein the processor is configured to process the instructions to:
    - receive third data sequence associated with the first source speaker, said third data sequence corresponding to speech vectors produced based on sound provided by the first source speaker; and
      
      apply the voice conversion function to the third data sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
HMD Global Oy
Original Assignee
Nokia Corporation
Inventors
Tian, Jilei, Popa, Victor, Nurminen, Jani
Primary Examiner(s)
Vincent, David R
Assistant Examiner(s)
Brown, Jr., Nathan H

Application Number

US11/380,289
Publication Number

US 20070256189A1
Time in Patent Office

1,056 Days
Field of Search

706/45, 704/203, 704/208, 704/256.1, 704/256.7
US Class Current

706/45
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 2021/0135 Voice conversion or morphing

Soft alignment based on a probability of time alignment

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Soft alignment based on a probability of time alignment

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links