Prosody generation for text-to-speech synthesis based on micro-prosodic data

US 20060074678A1
Filed: 09/29/2004
Published: 04/06/2006
Est. Priority Date: 09/29/2004
Status: Abandoned Application

First Claim

Patent Images

1. A prosody modification system for use in text-to-speech, comprising:

an input receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform; and

a prosody data warping module directly deriving new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform, thereby ensuring that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn, causing the errors to be reversed during re-synthesis and therefore eliminated, and resulting in micro-prosodic perturbations being preserved during re-synthesis.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A prosody modification system for use in text-to-speech includes an input receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform. A prosody data warping module directly derives new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform. The smoothness and simplicity of the function ensure that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn. The errors are thus reversed during re-synthesis and therefore eliminated, resulting in micro-prosodic perturbations being preserved during re-synthesis.

42 Citations

View as Search Results

50 Claims

1. A prosody modification system for use in text-to-speech, comprising:
- an input receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform; and
  
  a prosody data warping module directly deriving new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform, thereby ensuring that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn, causing the errors to be reversed during re-synthesis and therefore eliminated, and resulting in micro-prosodic perturbations being preserved during re-synthesis.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1, wherein said data warping module uses a function that incorporates a polynomial of time Tn or incorporates a polynomial in n.
  - 3. The system of claim 2, wherein said data warping module warps a pitch curve of one sound unit (represented as a sequence of pulse periods {Pn}) into another pitch curve (represented by a corresponding sequence of new pulse periods {Qn}) by adjusting coefficients of the polynomial, said coefficients being the pitch warping parameters, while retaining inherent micro-prosodic information.
  - 4. The system of claim 1, wherein said prosodic data vectors include, as a component, a sequence of periods between adjacent pulses in the sound waveform according to:
    - Pn=T(n)−
      
      T(n-1), where T(n) is time at an n^thpulse, and Qn is a corresponding new period derived by applying a pitch warping function.
  - 5. The system of claim 1, wherein said prosodic data vectors include, as a component, a sequence of amplitudes measured in the sound waveform, where Pn is amplitude at time Tn, and Qn is a new amplitude for the for the time Tn that is derived by applying an amplitude warping function.
  - 6. The system of claim 1, wherein said prosodic data vectors include, as a component, a sequence of speech-rate values measured from the sound waveform, and corresponding output includes new speech rate values derived by applying a speech-rate warping function.

7. A prosody generation system for use in text-to-speech synthesis, comprising:
- an input receiving a sequence of original sound units {Uj}, which when concatenated yield a desired synthetic phrase or sentence;
  
  a prosody data warping module which directly derives new prosodic data vectors {Qjn} from original prosodic data vectors {Pjn} sampled from an original sound unit Uj, and thus modifies perceived prosody of the sound unit, and a controlling module, which determines an amount of prosodic modification for sound units in the input sequence, and presents this information as warping parameters per sound unit, along with prosodic data of the sound units, to the prosody data warping module, and a prosody concatenation module, which concatenates prosodic data of the prosodically modified sound units with adjacent sound units, performs a smoothing of prosodic attributes between adjacent sound units, and outputs a single and final sequence of prosodic data vectors, which are synchronized with the entire phrase or sentence.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 8. The system of claim 7, wherein said controlling module adjusts the warping parameters for each sound unit by minimizing a cost function, which is in part, a function of the warping parameters, and whose design is based on desired results pertaining to output speech sound.
  - 9. The system of claim 8, wherein said controlling module achieves minimization of the cost function by iteratively searching through a space of the warping parameters to find an optimal solution.
  - 10. The system of claim 9, wherein said controlling module observes different freedom of movement criteria for sound units, wherein the freedom of movement criteria govern how rapidly sound units can move in prosodic space during iterative search, and wherein motion in searching the warping parameter space corresponds to simultaneous motion of all modified sound units in prosodic space.
  - 11. The system of claim 10, wherein said controlling module causes relatively longer sound units to move less rapidly in prosodic space than relatively shorter sound units.
  - 12. The system of claim 10, wherein said controlling module causes a sound unit from a relatively stressed word to move less rapidly in prosodic space than sound units from relatively unstressed words.
  - 13. The system of claim 10, wherein said controlling module causes a sound unit from a word of relatively more importance in sentence function to move less rapidly in prosodic space than a sound unit from a word of relatively less importance in sentence function.
  - 14. The system of claim 10, wherein said controlling module causes a sound unit from a final syllable of a sentence to move less rapidly in prosodic space than a sound unit from a non-final syllable of the sentence.
  - 15. The system of claim 10, wherein said controlling module causes a sound unit from a final syllable of a clause to move less rapidly in prosodic space than a sound unit from a non-final syllable of the clause.
  - 16. The system of claim 8, wherein said controlling module iteratively searches through the space of the warping parameters by iteratively searching over a sentence, including starting sound units of the sentence at chosen positions in prosodic space, and adjusting warping parameters of the sound units iteratively over the sentence to yield a global minimum in cost function, and hence a minimum of prosodic discontinuity for the sentence.
  - 17. The system of claim 16, wherein said controlling module starts a sound unit at its original position in prosodic space, thus minimizing overall motion in prosodic space while still yielding a desired level of prosodic continuity for the sentence.
  - 18. The system of claim 16, wherein said controlling module starts each sound unit at rule-based prosody target.
  - 19. The system of claim 16, wherein said controlling module initially positions the sound units according to larger prosody units selected from a prosody corpus.
  - 20. The system of claim 8, wherein said controlling module achieves minimization of the cost function by analytically solving a system of linear equations.
  - 21. The system of claim 8, wherein said controlling module computes a component part of the cost function by measuring an absolute difference in prosodic data values occurring in cross-fade regions of adjacent sound units, and thus computes prosody warping parameters which improve prosodic continuity between adjacent sound units.
  - 22. The system of claim 8, wherein said controlling module computes a component part of the cost function by measuring a difference in prosodic data values between an original prosodic value of a sound unit and a warped prosodic values of the sound unit, and thus computes prosody warping parameters which minimize the overall amount of distortion caused by prosodic modification of sound units.
  - 23. The system of claim 8, wherein said input is further receptive of a target prosodic function of time, which is derived independently of the sound unit data, and said controlling module computes a component part of the cost function by measuring an absolute difference in prosodic data values between an inherent prosodic value of a sound unit and the target prosodic function, and thus by minimizing the cost function, computes prosody warping parameters which yield an output prosody approximating the target prosody function.
  - 24. The system of claim 7, wherein said prosody concatenation module determines what period to use for pulses in an overlapping region occurring between two overlapping sound units to be concatenated.
  - 25. The system of claim 24, wherein said prosody concatenation module calculates a cross-fade, of periods for two overlapping sound units that is synchronous with a waveform cross-fade between glottal pulses of the two overlapping sound units.
  - 26. The system of claim 24, wherein said prosody concatenation module calculates a cross-faded period P according to:
    - P=(1−
      
      F)*P1+F*P2for two adjacent sound units respectively having original period P1 and original period P2, wherein a cross-fade factor F is going from 0 to 1.
  - 27. The system of claim 24, wherein said prosody concatenation module calculates a cross-faded period P according to:
    - P=exp((1−
      
      F)*log(P1)+F*log(P2) for two adjacent sound units respectively having original period P1 and original period P2 if a log domain pitch representation is desired.
  - 28. The system of claim 7, wherein said input is further receptive of a target prosodic function of time, which is derived independently of the sound unit data, and said controlling module uses the target prosodic function of time in its determination of warping parameters for each sound unit.
  - 29. The system of claim 7, wherein said controlling module adjusts the warping parameters for each sound unit according to rules, which respond to features derived from input text to a TTS system.
  - 30. The system of claim 7, wherein said input receives a sequence of diphones from a diphone database.
  - 31. The system of claim 7, wherein said prosody data warping module employs segment boundaries of sound units as time origins for computing time Tn for the sound units.
  - 32. The system of claim 7, wherein said prosody data warping module derives a new period sequence Qjn for each sound unit Uj according to:
    - Qjn=exp(log(Pjn)+Aj2*Tjn*Tjn+Aj1*Tjn+Aj0), where Aj0, Aj1, and Aj2 are warping parameters that are determined for sound unit Uj, Pjn is an original period sequence for sound unit Uj, and Tjn is a time at which an n^thpulse of Uj is placed respective of a time origin for Uj.
  - 33. The system of claim 7, wherein said prosody data warping module derives a new period sequence Qjn for each sound unit Uj according to:
    - Qjn=Pjn+Aj2*Tjn*Tjn+Aj1*Tjn+Aj0where Aj0, Aj1, and Aj2 are warping parameters that are determined for sound unit Uj, Pjn is the original period sequence for sound unit Uj, and Tjn is a time at which an n^thpulse of Uj is placed respective of a time origin for Uj.
  - 34. The system of claim 7, wherein said prosodic data warping module derives Qn according to:
    - Qn=F(n,T0,T1, . . . Tm,P1,P2, . . . Pm,A0,A1, . . . Ak) where F is a family of functions determined by the “
      
      warping parameters”
      
      A0, . . . Ak.

35. A prosody modification method for use in text-to-speech, comprising:
- receiving a sequence of prosodic data vectors Pn, measured at time Tn, which samples a sound waveform; and
  
  directly deriving new prosodic data vectors Qn from the original data vectors Pn using a function, which is controlled by warping parameters A0, . . . Ak, which avoids round-off errors in deriving quantized values, which has derivatives with respect to A0, . . . Ak, Pn, and Tn that are continuous, and which has sufficiently high complexity to model intentional prosody of the sound waveform, and sufficiently low complexity to avoid modeling micro-prosody of the sound waveform, thereby ensuring that micro-prosodic perturbations and errors in measurement of Tn are transferred directly to the output Qn, causing the errors to be reversed during re-synthesis and therefore eliminated, and resulting in micro-prosodic perturbations being preserved during re-synthesis.
- View Dependent Claims (36, 37, 38, 39, 40)
- - 36. The method of claim 35, wherein directly deriving new prosodic data vectors includes using a function that incorporates a polynomial of time Tn or incorporates a polynomial in n.
  - 37. The method of claim 36, wherein directly deriving new pitch synchronous prosodic data vectors includes warping a pitch curve of one sound unit (represented as a sequence of pulse periods {Pn}) into another pitch curve (represented by a corresponding sequence of new pulse periods {Qn}) by adjusting coefficients of the polynomial, said coefficients being the pitch warping parameters, while retaining inherent micro-prosodic information.
  - 38. The method of claim 35, wherein receiving the sequence includes receiving a sequence of periods between adjacent pulses in the sound waveform according to:
    - Pn=T(n)−
      
      T(n-1), where T(n) is time at an n^thpulse, and Qn is a corresponding new period derived by applying a pitch warping function.
  - 39. The method of claim 35, wherein receiving the sequence includes receiving a sequence of amplitudes measured in the sound waveform, where Pn is amplitude at time Tn, and Qn is a new amplitude for the for the time Tn that is derived by applying an amplitude warping function.
  - 40. The method of claim 35, wherein receiving the sequence includes receiving a sequence of speech-rate values measured from the sound waveform, the method further comprising outputting new speech rate values derived by applying a speech-rate warping function.

41. A prosody generation method for use in text-to-speech synthesis, comprising:
- receiving a sequence of original sound units {Uj}, which when concatenated yield a desired synthetic phrase or sentence;
  
  directly deriving new prosodic data vectors {Qjn} from original prosodic data vectors {Pjn} sampled from an original sound unit Uj, thus modifying perceived prosody of the sound unit;
  
  determining an amount of prosodic modification for sound units in the input sequence;
  
  presenting the amount of prosodic modification as warping parameters per sound unit, along with prosodic data of the sound units;
  
  concatenating prosodic data of the prosodically modified sound units with adjacent sound units;
  
  performing a smoothing of prosodic attributes between adjacent sound units; and
  
  outputing a single and final sequence of prosodic data vectors, which are synchronized with the entire phrase or sentence.
- View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49, 50)
- - 42. The method of claim 41, further comprising adjusting the warping parameters for each sound unit by minimizing a cost function, which is in part, a function of the warping parameters, and whose design is based on desired results pertaining to output speech sound.
  - 43. The method of claim 42, further comprising:
    - receiving a target prosodic function of time, which is derived independently of the sound unit data; and
      
      computing a component part of the cost function by measuring an absolute difference in prosodic data values between an inherent prosodic value of a sound unit and the target prosodic function, and thus by minimizing the cost function, computing prosody warping parameters which yield an output prosody approximating the target prosody function.
  - 44. The method of claim 43, further comprising observing different freedom of movement criteria for sound units, wherein the freedom of movement criteria govern how rapidly sound units can move in prosodic space during iterative search, and wherein motion in searching the warping parameter space corresponds to simultaneous motion of all modified sound units in prosodic space.
  - 45. The method of claim 42, further comprising minimizing the cost function by iteratively searching through a space of the warping parameters to find an optimal solution.
  - 46. The method of claim 42, further comprising minimizing the cost function by analytically solving a system of linear equations.
  - 47. The method of claim 42, further comprising computing a component part of the cost function by measuring an absolute difference in prosodic data values occurring in cross-fade regions of adjacent sound units, and thus computing prosody warping parameters which improve prosodic continuity between adjacent sound units.
  - 48. The method of claim 42, further comprising computing a component part of the cost function by measuring a difference in prosodic data values between an original prosodic value of a sound unit and a warped prosodic value of the sound unit, and thus computing prosody warping parameters which minimize the overall amount of distortion caused by prosodic modification of sound units.
  - 49. The method of claim 41, further comprising:
    - receiving a target prosodic function of time, which is derived independently of the sound unit data; and
      
      determining the warping parameters for each sound unit based on the target prosodic function of time.
  - 50. The method of claim 41, further comprising adjusting the warping parameters for sound units according to rules, which respond to features derived from input text to a TTS system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Panasonic Corporation (Panasonic Holdings Corporation)
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Pearson, Steven, Meron, Joram

Application Number

US10/953,878
Publication Number

US 20060074678A1
Time in Patent Office

Days
Field of Search
US Class Current

704/267
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Prosody generation for text-to-speech synthesis based on micro-prosodic data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

42 Citations

50 Claims

Specification

Use Cases

Quick Links

Others

Prosody generation for text-to-speech synthesis based on micro-prosodic data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

42 Citations

50 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others