Automatic donor ranking and selection system and method for voice conversion

US 20070027687A1
Filed: 03/14/2006
Published: 02/01/2007
Est. Priority Date: 03/14/2005
Status: Abandoned Application

First Claim

Patent Images

1. A donor ranking system comprising:

an acoustical feature extractor which extracts one or more acoustical features from a donor speech sample and a target speaker speech sample; and

an adaptive system which generates a prediction for a voice conversion quality value based on the acoustical features.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic donor selection algorithm estimates the subjective voice conversion output quality from a set of objective distance measures between the source and target speaker'"'"'s acoustical features. The algorithm learns the relationship of the subjective scores and the objective distance measures through nonlinear regression with an MLP. Once the MLP is trained, the algorithm can be used in the selection or ranking of a set of source speakers in terms of the expected output quality for transformations to a specific target voice.

38 Citations

View as Search Results

22 Claims

1. A donor ranking system comprising:
- an acoustical feature extractor which extracts one or more acoustical features from a donor speech sample and a target speaker speech sample; and
  
  an adaptive system which generates a prediction for a voice conversion quality value based on the acoustical features.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system of claim 1, wherein the adaptive system is trained on a set of training data comprising a donor speech sample, a target speaker speech sample, and an actual voice conversion quality value.
  - 3. The system of claim 1, wherein the voice conversion quality value comprises a subjective ranking of the similarity of a transformed speech sample derived from the donor speech sample and the target speaker speech sample.
  - 4. The system of claim 1, wherein the voice conversion quality value comprises a MOS quality value.
  - 5. The system of claim 1, wherein the one or more acoustical features are selected from a group consisting of LSF distance, the rank-sum of a duration distribution, the rank-sum of a pitch distribution, the rank-sum of an energy distribution comprising a plurality of frame-by-frame energy values, the rank-sum of a distribution of spectral tilt values, the rank sum of a distribution of per period open quotient values of an EGG signal period, the rank-sum of a distribution of period-to-period jitter value, the rank-sum of a distribution of period to period shimmer value, the rank-sum of a distribution of soft phonation indices, the rank sum of a distribution of frame-by frame amplitude differences between first and second harmonics, the rank sum of a distribution of a period-by-period EGG shape value, and a combination thereof.
  - 6. The system of claim 5, wherein the duration distribution comprises a duration feature from a group consisting of phoneme duration, word duration, utterance duration, and inter-word silence duration.
  - 7. The system of claim 5, wherein the EGG shape value for a period is a slope of a least-squares fitted line from a group consisting of the segment between a glottal closure instant to a maximum value of the period, the segment of the EGG signal when the vocal folds are open, and the segment when the vocal folds are closing.
  - 8. A donor selection system comprising the donor ranking system of claim 1, wherein a plurality of speech samples from a plurality of donors is paired with the target speech sample and a donor is selected from the plurality of donors based on the prediction for each of the plurality of speech samples.

9. A method for ranking donors comprising:
- extracting one or more acoustical features from features from a donor speech sample and a target speaker speech sample; and
  
  predicting for a voice conversion quality value based on the acoustical features using a trained adaptive system
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The method of claim 9, wherein the adaptive system is trained on a set of training data comprising a donor speech sample, a target speaker speech sample, and an actual voice conversion quality value.
  - 11. The method of claim 9, wherein the voice conversion quality value comprises a subjective ranking of the similarity of a transformed speech sample derived from the donor speech sample and the target speaker speech sample.
  - 12. The method of claim 9, wherein the voice conversion quality value comprises a MOS quality value.
  - 13. The method of claim 9, wherein the one or more acoustical features are selected from a group consisting of LSF distance, the rank-sum of a duration distribution, the rank-sum of a pitch distribution, the rank-sum of an energy distribution comprising a plurality of frame-by-frame energy values, the rank-sum of a distribution of spectral tilt values, the rank sum of a distribution of per period open quotient values of an EGG signal period, the rank-sum of a distribution of period-to-period jitter value, the rank-sum of a distribution of period-to period shimmer value, the rank-sum of a distribution of soft phonation indices, the rank sum of a distribution of frame-by frame amplitude differences between first and second harmonics, the rank sum of a distribution of a period-by-period EGG shape value, and a combination thereof.
  - 14. The method of claim 13, wherein the duration distribution comprises a duration feature from a group consisting of phoneme duration, word duration, utterance duration, and inter-word silence duration.
  - 15. The method of claim 13, wherein the EGG shape value for a period is a slope of a least-squares fitted line from a group consisting of the segment between a glottal closure instant to a maximum value of the period, the segment of the EGG signal when the vocal folds are open, and the segment when the vocal folds are closing.

16. A method for training a donor ranking system comprising:
- selecting a donor and a target speaker, having vocal characteristics, from a training database of speech samples;
  
  deriving an actual subjective quality value;
  
  extracting one or more acoustical features from a donor voice speech sample and a target speaker voice speech sample;
  
  supplying the one or more acoustical features to an adaptive system;
  
  predicting a predicted subjective quality value using the adaptive system;
  
  calculating an error value between the predicted subjective quality value and the actual subjective quality value; and
  
  adjusting the adaptive system based on the error value.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The method of claim 16, wherein the deriving an actual subjective quality value comprises:
    - converting the donor voice speech sample to a converted voice speech sample having the vocal characteristics of the target speaker;
      
      providing the converted voice speech sample and the target speaker voice speech sample to a subjective listener; and
      
      receiving the actual subjective quality value from the subjective listener.
  - 18. The method of claim 17, wherein the subjective listener comprises a plurality of constituent listeners and the actual subjective quality value is a statistical combination of constituent quality values received from each of the constituent listeners.
  - 19. The method of claim 18, where in the statistical combination is an average.
  - 20. The method of claim 17, wherein the one or more acoustical features are selected from a group consisting of LSF distance, the rank-sum of a duration distribution, the rank-sum of a pitch distribution, the rank-sum of an energy distribution comprising a plurality of frame-by-frame energy values, the rank-sum of a distribution of spectral tilt values, the rank sum of a distribution of per period open quotient values of an EGG signal period, the rank-sum of a distribution of period-to-period jitter value, the rank-sum of a distribution of period-to period shimmer value, the rank-sum of a distribution of soft phonation indices, the rank sum of a distribution of frame-by frame amplitude differences between first and second harmonics, the rank sum of a distribution of a period-by-period EGG shape value, and a combination thereof.
  - 21. The method of claim 20, wherein the duration distribution comprises a duration feature from a group consisting of phoneme duration, word duration, utterance duration, and inter-word silence duration.
  - 22. The method of claim 20, wherein the EGG shape value for a period is a slope of a least-squares fitted line from a group consisting of the segment between a glottal closure instant to a maximum value of the period, the segment of the EGG signal when the vocal folds are open, and the segment when the vocal folds are closing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Voxonic Incorporated
Original Assignee
Voxonic Incorporated
Inventors
Deutsch, Fred, Turk, Oytun, Arslan, Levent

Application Number

US11/376,377
Publication Number

US 20070027687A1
Time in Patent Office

Days
Field of Search
US Class Current

704/246
CPC Class Codes

G10L 2021/0135   Voice conversion or morphing

G10L 21/00   Speech or voice signal proc...

G10L 25/69   for evaluating synthetic or...

Automatic donor ranking and selection system and method for voice conversion

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

38 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic donor ranking and selection system and method for voice conversion

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

38 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links