Statistical enhancement of speech output from a statistical text-to-speech synthesis system

US 8,682,670 B2
Filed: 07/07/2011
Issued: 03/25/2014
Est. Priority Date: 07/07/2011
Status: Active Grant

First Claim

Patent Images

1. A method for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of short-time spectral envelope of speech in a space of acoustic feature vectors, comprising:

defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters, wherein number of the enhancing parameters in the set of enhancing parameters is less than a dimension of the space of the acoustic feature vectors;

defining a distortion indicator of a feature vector or a plurality of feature vectors, wherein the distortion indicator is not modelled directly by the statistical TTS system;

receiving a feature vector output by the system;

generating an instance of the corrective transformation by;

calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector;

calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector;

calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation;

deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations; and

applying the instance of the corrective transformation to the feature vector to provide an enhanced feature vector.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, system and computer program product are provided for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of speech in a space of acoustic feature vectors. The method includes: defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters; and defining a distortion indictor of a feature vector or a plurality of feature vectors. The method further includes: receiving a feature vector output by the system; and generating an instance of the corrective transformation by: calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector; calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector; calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation; and deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations. The instance of the corrective transformation may be applied to the feature vector to provide an enhanced feature vector.

23 Citations

View as Search Results

25 Claims

1. A method for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of short-time spectral envelope of speech in a space of acoustic feature vectors, comprising:
- defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters, wherein number of the enhancing parameters in the set of enhancing parameters is less than a dimension of the space of the acoustic feature vectors;
  
  defining a distortion indicator of a feature vector or a plurality of feature vectors, wherein the distortion indicator is not modelled directly by the statistical TTS system;
  
  receiving a feature vector output by the system;
  
  generating an instance of the corrective transformation by;
  
  calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector;
  
  calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector;
  
  calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation;
  
  deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations; and
  
  applying the instance of the corrective transformation to the feature vector to provide an enhanced feature vector.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method as claimed in claim 1, wherein the acoustic feature vector is a cepstral vector, the distortion indicator is an attenuation indicator, the parametric corrective transformation is a parametric corrective function of quefrency and applying the instance of the corrective transformation is a component-wise multiplication of the feature vector by the corrective function.
  - 3. The method as claimed in claim 2, wherein generating an instance of the corrective transformation is carried out for each emitted cepstral vector, or each phonetic unit.
  - 4. The method as claimed in claim 2, wherein calculating a reference value of an attenuation indicator averages over the emission probability distribution specified by the phonetic unit.
  - 5. The method as claimed in claim 2, wherein calculating an actual value of an attenuation indicator is based on said synthetic cepstral vector output from the system.
  - 6. The method as claimed in claim 2, wherein generating an instance of the corrective transformation is carried out off-line prior to receiving said cepstral vector output from the system, and calculating an actual value of the attenuation indicator is based on a plurality of cepstral vectors generated by the system off-line and emitted from the phonetic unit.
  - 7. The method as claimed in claim 2, wherein calculating the set of enhancing parameter values includes minimization of an enhancement criterion depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective function, and representing a dissimilarity between the reference distortion indicator and a predicted value of the distortion indicator attributed to an enhanced synthetic vector.
  - 8. The method as claimed in claim 7, further including altering the set of enhancing parameter values depending on external attributes associated with the statistical model emitting said cepstral vector.
  - 9. The method as claimed in claim 8, wherein the external attributes include a phone category which the statistical model is attributed to and voicing class of the majority of speech frames used for the statistical model training.
  - 10. The method as claimed in claim 2, wherein the parametric corrective function is an exponential function and the set of enhancing parameters is comprised of the exponent base.
  - 11. The method as claimed in claim 2, wherein the parametric corrective function is a piece-wise exponential function that comprises at least two pieces, wherein at least one of the pieces spans two or more quefrency points, and wherein the set of enhancing parameters is comprised of the base values of the individual exponents and of the concatenation points.
  - 12. The method as claimed in claim 2, wherein the attenuation indicator is a component-wise squared cepstral vector.
  - 13. The method as claimed in claim 12, including smoothing of the attenuation indicator components by a symmetric positive filter.
  - 14. The method as claimed in claim 1, wherein the statistical TTS system is a hidden Markov model (HMM) based TTS system employing Gaussian mixture emission probability distribution.

15. A computer program product for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of short-time spectral envelope of speech in a space of acoustic feature vectors, the computer program product comprising:
- a computer readable non-transitory storage medium having computer readable program code embodied therewith, the computer readable program code comprising;
  
  computer readable program code configured to;
  
  define a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters, wherein number of the enhancing parameters in the set of enhancing parameters is less than a dimension of the space of the acoustic feature vectors;
  
  define a distortion indicator of a feature vector or a plurality of feature vectors, wherein the distortion indicator is not modelled directly by the statistical TTS system;
  
  receive a feature vector output by the system;
  
  generate an instance of the corrective transformation by;
  
  calculating a reference value of the distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector;
  
  calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector;
  
  calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation;
  
  deriving an instance of the corrective transformation corresponding to the enhancing parameter values from the parametric family of the corrective transformations; and
  
  applying the instance of the corrective transformation to the feature vector to provide an enhanced feature vector.

16. A system for enhancement of speech synthesized by a statistical text-to-speech (TTS) system employing a parametric representation of short-time spectral envelope of speech in a space of acoustic feature vectors, comprising:
- a processor;
  
  an acoustic feature vector input component for receiving an acoustic feature vector emitted by a phonetic unit;
  
  a corrective transformation defining component for defining a parametric family of corrective transformations operating in the space of the acoustic feature vectors and dependent on a set of enhancing parameters, wherein number of the enhancing parameters in the set of enhancing parameters is less than a dimension of the space of the acoustic feature vectors;
  
  an enhancing parametric set component including;
  
  a distortion indicator reference component for calculating a reference value of a distortion indicator attributed to a statistical model of the phonetic unit emitting the feature vector;
  
  a distortion indicator actual value component for calculating an actual value of the distortion indicator attributed to feature vectors emitted by the statistical model of the phonetic unit emitting the feature vector, wherein the distortion indicator is not modelled directly by the statistical TTS system; and
  
  wherein the enhancing parameter set component calculating the enhancing parameter values depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation;
  
  a corrective transformation applying component for applying an instance of the corrective transformation to the feature vector to provide an enhanced feature vector.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 17. The system as claimed in claim 16, wherein the acoustic feature vector is a cepstral vector and the distortion indicator is an attenuation indicator, the parametric corrective transformation is a parametric corrective function of quefrency and applying the instance of the corrective transformation is the component-wise multiplication of the feature vector by the corrective function.
  - 18. The system as claimed in claim 17, wherein a distortion indicator reference component is an attenuation indicator component for calculating a reference value of the attenuation indicator averaged over the emission probability distribution specified by the phonetic unit.
  - 19. The system as claimed in claim 17, wherein a distortion indicator actual value component is an attenuation indicator actual value component for calculating an actual value of the attenuation indicator based on said synthetic cepstral vector output from the system.
  - 20. The system as claimed in claim 17, including:
    - an off-line enhancement calculation mechanism for deriving the enhancing parameters off-line prior to receiving cepstral vectors emitted from the phonetic unit, andwherein a distortion indicator actual value component is an attenuation indicator actual value component for calculating an actual value of an attenuation indicator based on a plurality of synthetic vectors generated off-line from a statistical model.
  - 21. The system as claimed in claim 17, wherein the parametric corrective function is an exponential function and the set of enhancing parameters set is comprised of the exponent base.
  - 22. The system as claimed in claim 17, wherein the parametric corrective function is a piece-wise exponential function and the set of enhancing parameters set is comprised of the base values of the individual exponents and of the concatenation points.
  - 23. The system as claimed in claim 16, wherein the enhancing parameter set component includes an enhancement criterion applying component for calculating the enhancing parameter values for minimization of an enhancement criterion depending on the reference value of the distortion indicator, the actual value of the distortion indicator and the parametric corrective transformation, and representing a dissimilarity between the reference distortion indicator and a predicted value of the distortion indicator attributed to an enhanced synthetic vector.
  - 24. The system as claimed in claim 16, wherein the statistical TTS system is a hidden Markov model (HMM) based TTS system employing Gaussian mixture emission probability distribution.
  - 25. The system as claimed in claim 16, further including a customization component for altering the set of enhancing parameter values depending on attributes of the statistical model emitting said feature vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Shechtman, Slava, Sorin, Alexander
Primary Examiner(s)
Shah, Paras D
Assistant Examiner(s)
Shan, Jie

Application Number

US13/177,577
Publication Number

US 20130013313A1
Time in Patent Office

992 Days
Field of Search

704/260
US Class Current

704/260
CPC Class Codes

G10L 13/033 Voice editing, e.g. manipul...

G10L 13/06 Elementary speech units use...

Statistical enhancement of speech output from a statistical text-to-speech synthesis system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

23 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Statistical enhancement of speech output from a statistical text-to-speech synthesis system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links