Voice font speaker and prosody interpolation

US 9,472,182 B2
Filed: 02/26/2014
Issued: 10/18/2016
Est. Priority Date: 02/26/2014
Status: Active Grant

First Claim

Patent Images

1. A method allowing computer-generated speech to be rendered with a multi-voice font that is different than source voice fonts used to generate the multi-voice font, the method comprising the acts of:

loading the source voice fonts;

assigning weights to characteristics of each source voice font;

obtaining text to be rendered as the computer-generated speech;

predicting characteristic values for the text for each source voice font using at least one characteristic prediction model associated with each source voice font;

merging the predicted characteristic values with the corresponding weights to produce interpolated characteristic values; and

rendering the text as computer-generated speech having the interpolated characteristic values.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Multi-voice font interpolation is provided. A multi-voice font interpolation engine allows the production of computer generated speech with a wide variety of speaker characteristics and/or prosody by interpolating speaker characteristics and prosody from existing fonts. Using prediction models from multiple voice fonts, the multi-voice font interpolation engine predicts values for the parameters that influence speaker characteristics and/or prosody for the phoneme sequence obtained from the text to spoken. For each parameter, additional parameter values are generated by a weighted interpolation from the predicted values. Modifying an existing voice font with the interpolated parameters changes the style and/or emotion of the speech while retaining the base sound qualities of the original voice. The multi-voice font interpolation engine allows the speaker characteristics and/or prosody to be transplanted from one voice font to another or entirely new speaker characteristics and/or prosody to be generated for an existing voice font.

20 Citations

View as Search Results

20 Claims

1. A method allowing computer-generated speech to be rendered with a multi-voice font that is different than source voice fonts used to generate the multi-voice font, the method comprising the acts of:
- loading the source voice fonts;
  
  assigning weights to characteristics of each source voice font;
  
  obtaining text to be rendered as the computer-generated speech;
  
  predicting characteristic values for the text for each source voice font using at least one characteristic prediction model associated with each source voice font;
  
  merging the predicted characteristic values with the corresponding weights to produce interpolated characteristic values; and
  
  rendering the text as computer-generated speech having the interpolated characteristic values.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 18)
- - 2. The method of claim 1 wherein the act of merging the predicted characteristic values with the corresponding weights further comprises the acts of:
    - multiplying the predicted characteristic values by the weight for the characteristic given to the source voice font used to predict the predicted characteristic values; and
      
      summing the weighted characteristic values to produce the interpolated characteristic values.
  - 3. The method of claim 1 wherein the act of assigning weights to characteristics of each source voice font further comprises the acts of:
    - assigning a duration weight to each source voice font;
      
      assigning a f0 weight to each source voice font; and
      
      assigning a spectrum weight to each source voice font.
  - 4. The method of claim 3 wherein the act of predicting characteristic values for the text using each source voice font further comprises the act of predicting duration values, voiced/unvoiced probability values, f0 values, and spectral trajectory values for the text using each source voice font.
  - 5. The method of claim 3 wherein the act of obtaining text to be rendered as the computer-generated speech further comprises the act of parsing the text into a sequence of phonemes dividable into frames.
  - 6. The method of claim 5 wherein the act of predicting characteristic values for the text using each source voice font further comprises the acts of:
    - predicting a duration value and a voiced/unvoiced probability value for each phoneme using each source voice font; and
      
      predicting a f0 value and a spectral trajectory value for each frame of each phoneme using each source voice font.
  - 7. The method of claim 5 wherein the act of merging the predicted characteristic values with the corresponding weights further comprises the acts of:
    - interpolating a duration value for each phoneme using the duration weights;
      
      interpolating an voiced/unvoiced probability value for each phoneme using the spectrum weights;
      
      making a voiced/unvoiced decision for each phoneme based on the voiced/unvoiced probability value for that phoneme;
      
      interpolating a f0 value for each phoneme using the f0 weight;
      
      normalizing the f0 values; and
      
      interpolating a spectral trajectory value for each phoneme using the spectrum weights.
  - 8. The method of claim 7 wherein:
    - the act of interpolating a duration value for each phoneme using the duration weights further comprises the acts of;
      
      multiplying the predicted duration values predicted using each source voice font by the corresponding duration weight to produce weighted duration values; and
      
      summing the weighted duration values from each source voice font for each phoneme to produce an interpolated duration value for that phoneme;
      
      the act of interpolating a voiced/unvoiced probability value for each phoneme using the spectrum weights further comprises the acts of;
      
      multiplying the predicted voiced/unvoiced probability values predicted using each source voice font by the corresponding spectrum weight to produce weighted voiced/unvoiced probability values; and
      
      summing the weighted voiced/unvoiced probability values from each source voice font for each phoneme to produce an interpolated voiced/unvoiced probability value for that phoneme;
      
      the act of interpolating a f0 value for each phoneme using the f0 weights further comprises the acts of;
      
      multiplying the predicted f0 values predicted using each source voice font by the corresponding f0 weight to produce weighted f0 values; and
      
      summing the corresponding weighted f0 values from each source voice font for each frame to produce an interpolated f0 value for that frame; and
      
      the act of interpolating spectral trajectory value for each phoneme using the spectrum weights further comprises the acts of;
      
      multiplying the predicted spectral trajectory values predicted using each source voice font by the corresponding spectrum weight to produce weighted spectral trajectory values; and
      
      summing the corresponding weighted spectral trajectory values from each source voice font for each frame to produce an interpolated spectral trajectory value for that frame.
  - 9. The method of claim 1 further comprising the act of synthesizing the text as computer-generated speech using the interpolated characteristic values.
  - 10. The method of claim 1 further comprising the act of storing a multi-voice font definition specifying the source voice fonts used to generate the multi-voice font and linking each source voice fonts with the characteristic weights assigned to that source voice font.
  - 18. The method of claim 1 wherein the act of assigning weights to characteristics of each source voice font further comprises the act of proportioning the weights for each characteristic such that the sum of the weights for each characteristic is substantially equal to one.

11. A system for generating a multi-voice font from a plurality of source voice fonts, the system comprising:
- a phoneme sequencer for parsing input text into a sequence of phonemes;
  
  a predictor operable to predict values of voice font characteristics for the phonemes for each source voice font of the plurality of source voice fonts using at least one characteristic model associated with each source voice font;
  
  a weight selector operable to assign a duration weight, a f0 weight, and a spectrum weight to each source voice font, the duration weight, the f0 weight, and the spectrum weight determining the relative contribution of the voice font characteristics predicted for the corresponding source voice font to the multi-voice font;
  
  an interpolator operable to merge the predicted voice font characteristics with the weights to produce the multi-voice font having voice font characteristics derived from the source voice fonts; and
  
  a voice encoder operable to render the input text as computer-generated speech using the multi-voice font, the computer-generated speech having the voice font characteristics derived from the source voice fonts.
- View Dependent Claims (12, 13, 14)
- - 12. The system of claim 11 wherein the predictor further comprises:
    - a duration predictor operable to predict duration values for each phoneme using a duration prediction model provided by each source voice font;
      
      a f0 predictor operable to predict f0 values for each phoneme using a f0 prediction model provided by each source voice font;
      
      a spectral trajectory predictor operable to predict spectral trajectory values for each phoneme using a spectrum prediction model provided by each source voice font; and
      
      a voiced/unvoiced probability predictor operable to predict voiced/unvoiced probability values for each phoneme using a voiced/unvoiced decision model provided by each source voice font.
  - 13. The system of claim 11 wherein the interpolator further comprises:
    - a duration interpolator operable to merge the predicted duration values for each phoneme with the duration weight for the predicting source voice font to produce an interpolated duration value for each phoneme;
      
      a f0 interpolator operable to merge the predicted f0 values for each phoneme with the f0 weight for the predicting source voice font to produce an interpolated f0 value for each frame of the phoneme;
      
      a spectral trajectory interpolator operable to merge the predicted spectral trajectory values for each phoneme with the spectrum weight for the predicting source voice font to produce an interpolated spectrum trajectory value for each frame of the phoneme; and
      
      a voiced/unvoiced decision interpolator operable to merge the predicted voiced/unvoiced probability values for each phoneme with the spectrum weight for the predicting source voice font to produce an interpolated voiced/unvoiced probability value for each phoneme and compare the interpolated voiced/unvoiced probability value to a threshold to determine an interpolated voiced/unvoiced decision value for each phoneme.
  - 14. The system of claim 11 further comprising a normalizer operable merge the spectrum weight with an estimated f0 upper limit and an estimated f0 lower limit for the predicted f0 range for each frame of each phoneme to produce a first f0 limit pair, merge the f0 weight with an estimated f0 upper limit and an estimated f0 lower limit for the interpolated f0 range for each frame of each phoneme to produce a second f0 limit pair, and normalize the interpolated f0 values using the first f0 limit pair and the second f0 limit pair.

15. A tangible computer storage medium containing computer executable instructions which, when executed by a computer, perform a method of generating a multi-voice font for rendering text as computer-generated speech, the method comprising the acts of:
- obtaining the text to be rendered as the computer-generated speech;
  
  loading the source voice fonts;
  
  predicting duration values, voiced/unvoiced probability values, f0 values, and spectral trajectory values for the text for each source voice font using at least one characteristic prediction model associated with each source voice font;
  
  assigning a duration weight, a f0 weight, a spectrum weight to each source voice font;
  
  merging the duration values predicted with each source voice font with the duration weight to produce interpolated duration values, the duration weight for each source voice font representing the percentage that the source voice font contributes to the interpolated duration values;
  
  merging the f0 values predicted with each source voice font with the f0 weight given to that source voice font to produce interpolated f0 values, the f0 weight for each source voice font representing the percentage that the source voice font contributes to the interpolated f0 values;
  
  merging the voiced/unvoiced decision values and the spectral trajectory values predicted with each source voice font with the spectrum weight given to that source voice font to produce interpolated voiced/unvoiced probability values and interpolated spectral trajectory values, the spectrum weight for each source voice font representing the percentage that the source voice font contributes to the interpolated voiced/unvoiced probability values and interpolated spectral trajectory values; and
  
  rendering the text as computer-generated speech having the interpolated duration values, interpolated f0 values, interpolated voiced/unvoiced probability values, and interpolated spectral trajectory values.
- View Dependent Claims (16, 17, 19, 20)
- - 16. The tangible computer storage medium of claim 15 wherein the method performed by the computer executable instructions further comprises the act of using the interpolated duration values, the interpolated voiced/unvoiced decision values, the interpolated f0 values, and the interpolated spectral trajectory values to render the text as computer-generated speech.
  - 17. The tangible computer storage medium of claim 15 wherein the method performed by the computer executable instructions further comprises the act of parsing the text into a sequence of phonemes with each phoneme dividable into frames and:
    - the act of merging the duration values predicted with each source voice font with the duration weight given to that source voice font to produce interpolated duration values further comprises the act of, for each phoneme, summing the products of the duration value predicted by each source voice font and the corresponding duration weight;
      
      the act of merging the f0 values predicted with each source voice font with the f0 weight given to that source voice font to produce interpolated f0 values further comprises the acts of;
      
      for each frame, summing the products of the f0 values predicted by each source voice font and the corresponding f0 weight; and
      
      normalizing the interpolated f0 values; and
      
      the act of merging the voiced/unvoiced probability values and the spectral trajectory values predicted with each source voice font with the spectrum weight given to that source voice font to produce interpolated voiced/unvoiced probability values and interpolated spectral trajectory values further comprises the acts of;
      
      for each phoneme, summing the products of the voiced/unvoiced probability values predicted by each source voice font and the corresponding spectrum weight;
      
      determining whether each phoneme is voiced using the interpolated voiced/unvoiced probability value for each phoneme; and
      
      for each frame, summing the products of the spectral trajectory values predicted by each source voice font and the corresponding spectrum weight.
  - 19. The tangible computer storage medium of claim 15 wherein the method performed by the computer executable instructions further comprises the act of synthesizing the text as computer-generated speech using the interpolated characteristic values.
  - 20. The tangible computer storage medium of claim 15 wherein the method performed by the computer executable instructions further comprises the act of storing a multi-voice font definition specifying the source voice fonts used to generate the multi-voice font and linking each source voice fonts with the characteristic weights assigned to that source voice font.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Luan, Jian, He, Lei, Leung, Max
Primary Examiner(s)
Pham, Thierry L

Application Number

US14/190,875
Publication Number

US 20150243275A1
Time in Patent Office

965 Days
Field of Search

704/260, 704/235, 704/208, 704/214
US Class Current

1/1
CPC Class Codes

G06F 3/0482   Interaction with lists of s...

G06F 3/04847   Interaction techniques to c...

G10L 13/02   Methods for producing synth...

G10L 13/033   Voice editing, e.g. manipul...

G10L 13/0335   Pitch control

G10L 13/08   Text analysis or generation...

Voice font speaker and prosody interpolation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Voice font speaker and prosody interpolation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others