Speech Enhancement Techniques on the Power Spectrum

US 20120265534A1
Filed: 09/04/2009
Published: 10/18/2012
Est. Priority Date: 09/04/2009
Status: Active Grant

First Claim

Patent Images

1. A method for providing spectral speech descriptions to be used for synthesis of a speech utterance comprising the steps ofreceiving at least one spectral envelope input representation corresponding to the speech utterance,where the at least one spectral envelope input representation includes at least one of at least one formant and at least one spectral trough in the form of at least one of a local peak and a local valley in the spectral envelope input representation,extracting from the at least one spectral envelope input representation a rapidly varying input component, where the rapidly varying input component is generated, at least in part, by removing from the at least one spectral envelope input representation a slowly varying input component in the form of a non-constant coarse shape of the at least one spectral envelope input representation and by keeping the fine details of the at least one spectral envelope input representation, where the details contain at least one of a peak or a valley,creating a rapidly varying final component, where the rapidly varying final component is derived from the rapidly varying input component by manipulating at least one of at least one peak and at least one valley,combining the rapidly varying final component with one of the slowly varying final component and the spectral envelope input representation to form a spectral envelope final representation, andproviding a spectral speech description output vector to be used for synthesis of a speech utterance, where at least a part of the spectral speech description output vector is derived from the spectral envelope final representation.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The method provides a spectral speech description to be used for synthesis of a speech utterance, where at least one spectral envelope input representation is received. In one solution the improvement is made by manipulation an extremum, i.e. a peak or a valley, in the rapidly varying component of the spectral envelope representation. The rapidly varying component of the spectral envelope representation is manipulated to sharpen and/or accentuate extrema after which it is merged back with the slowly varying component or the spectral envelope input representation to create an enhanced spectral envelope final representation. In other solutions a complex spectrum envelope final representation is created with phase information derived from one of the group delay representation of a real spectral envelope input representation corresponding to a short-time speech signal and a transformed phase component of the discrete complex frequency domain input representation corresponding to the speech utterance.

Citations

28 Claims

1. A method for providing spectral speech descriptions to be used for synthesis of a speech utterance comprising the steps ofreceiving at least one spectral envelope input representation corresponding to the speech utterance,where the at least one spectral envelope input representation includes at least one of at least one formant and at least one spectral trough in the form of at least one of a local peak and a local valley in the spectral envelope input representation,extracting from the at least one spectral envelope input representation a rapidly varying input component, where the rapidly varying input component is generated, at least in part, by removing from the at least one spectral envelope input representation a slowly varying input component in the form of a non-constant coarse shape of the at least one spectral envelope input representation and by keeping the fine details of the at least one spectral envelope input representation, where the details contain at least one of a peak or a valley,creating a rapidly varying final component, where the rapidly varying final component is derived from the rapidly varying input component by manipulating at least one of at least one peak and at least one valley,combining the rapidly varying final component with one of the slowly varying final component and the spectral envelope input representation to form a spectral envelope final representation, andproviding a spectral speech description output vector to be used for synthesis of a speech utterance, where at least a part of the spectral speech description output vector is derived from the spectral envelope final representation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 28)
- - 2. Method as claimed in claim 1, where extracting a rapidly varying input component includesgenerating the slowly varying input component, at least in part, through smoothing of the spectral envelope input representation, where the smoothing attenuates the magnitude of at least one of the formant and the spectral trough and preserves a non-constant coarse shape of the spectral envelope input representation andderiving the rapidly varying input component by subtracting the slowly varying input component from the spectral envelope input representation.
  - 3. Method as claimed in claim 2, where the step of generating the slowly varying input component includes low-pass (LP) filtering the spectral envelope input representation.
  - 4. Method as claimed in claim 2, where the step of generating the slowly varying input component includes deriving the average of a first interpolation function E_max(n) interpolating the maxima of the spectral envelope input representation and a second interpolation function E_min(n) interpolating the minima of the spectral envelope input representation.
  - 5. Method as claimed in claim 4, where maxima and minima are found by determining extrema of the spectral envelope input representation and by classifying them as minima or maxima and where for interpolating the interpolation functions shape-preserving piecewise cubic Hermite interpolating polynomials are used as interpolation kernels and where both interpolation functions are almost identical in the neighbourhood of at least one of the Nyquist frequency and the zero frequency and therefore the rapidly varying input component is small preferably zero at at least one of Nyquist frequency and zero frequency.
  - 6. Method as claimed in claim 1, where the step of extracting from the at least one spectral envelope input representation a rapidly varying input component includes generating the rapidly varying input component at least in part by filtering the spectral envelope input representation with a high pass (HP) filter.
  - 7. Method as claimed in claim 1, where the step of creating a rapidly varying final component includes modifying the rapidly varying input component with a transformation that attenuates the excursion of at least one of a first local minimum and a first local maximum of the rapidly varying input component and preserves the excursion of at least one of a second local maximum and a second local minimum of the rapidly varying component.
  - 8. Method as claimed in claim 7, where the transformation performs at least one of sharpening the peaks and deepening the valleys in the spectral envelope input representation, preferably by multiplying the rapidly varying input component with a positive function that varies as a function of the frequency.
  - 28. A computer program comprising program code means for performing all the steps of any one of the claims 1 to 27 when said program is run on a computer.

9. A method for providing a spectral speech description output vector to be used for synthesis of a short-time speech signal comprising the steps ofreceiving at least one real spectral envelope input representation corresponding to the short-time speech signal,deriving a group delay representation that is the output of a non-constant function of the at least one real spectral envelope input representation,deriving a phase representation from the group delay representation by inverting the sign of the group delay representation and integrating the inverted group delay representation,deriving from the at least one real spectral envelope input representation at least one real spectral envelope final representation,combining the real spectral envelope final representation and the phase representation to form a complex spectrum envelope final representation, andproviding a spectral speech description output vector to be used for synthesis of a short-time speech signal, where at least a part of the spectral speech description output vector is derived from the complex spectral envelope final representation.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. Method as claimed in claim 9, where the group delay representation is the output of a linear function applied to the spectral envelope input representation.
  - 11. Method as claimed in claim 9, where a part of the phase representation is merged with weighted noise.
  - 12. Method as claimed in claim 9, where phase jitter is reduced by at least one of smoothing the phase representations corresponding to successive short-time speech signals and correcting the group delay representations corresponding to successive short-time speech signals by an offset.
  - 13. Method as claimed in claim 9 where receiving at least one real spectral envelope input representation corresponding to the short-time speech signal comprises the steps of receiving at least one short-time speech signal,determining at least one pitch anchor point in at least one voiced region of the at least one short-time speech signal,selecting at least one window whose size is approximately twice the local pitch period,multiplying the at least one short-time speech signal with the at least one window, positioned in such way that the distance between the center of the window and the at least one pitch anchor point is constant, to form at least one windowed short-time speech signal,deriving the at least one real spectral envelope input representation from the at least one windowed short-time speech signal.
  - 14. Method as claimed in claim 9, where the at least one real spectral envelope input representation includes at least one of at least one formant and at least one spectral trough in the form of at least one of a local peak and a local valley in the real spectral envelope input representation and where deriving from the real spectral envelope input representation a real spectral envelope final representation comprises the steps ofextracting from the at least one real spectral envelope input representation a rapidly varying input component, where the rapidly varying input component is generated, at least in part, by removing from the at least one real spectral envelope input representation a slowly varying input component in the form of a non-constant coarse shape of the at least one real spectral envelope input representation and by keeping the fine details of the at least one real spectral envelope input representation, where the details contain at least one of a peak and a valley,creating a rapidly varying final component, where the rapidly varying final component is derived from the rapidly varying input component by manipulating at least one of at least one peak and at least one valley, andcombining the rapidly varying final component with one of the slowly varying final component and the real spectral envelope input representation to form a real spectral envelope final representation.

15. A method for providing a speech description vector to be used for synthesis of a speech utterance comprising the steps ofreceiving at least one discrete complex frequency domain input representation corresponding to the speech utterance,decomposing the complex frequency domain input representation into a magnitude and a phase component defined at a set of input frequencies,transforming the phase component to a transformed phase component having less discontinuities,compressing the magnitude component with a compression function to form a compressed magnitude component,interpolating the compressed magnitude and transformed phase components at a set of output frequencies to form a frequency warped compressed magnitude and a frequency warped transformed phase component, the output frequencies being obtained by transforming the input frequencies by means of a frequency warping function that maps at least one input frequency to a different output frequency,rotating the frequency warped phase component in the complex plane by 90 degrees to obtain a purely imaginary frequency warped phase component,adding the frequency warped compressed magnitude component to the purely imaginary frequency warped phase component to form a complex frequency warped compressed spectrum representation,projecting the complex frequency warped compressed spectrum representation onto a non-empty ordered set of complex basis functions to form a complex frequency warped cepstrum representation to be used for synthesis of a speech utterance.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. Method as claimed in claim 15, where the discrete complex frequency domain input representation is derived from a pitch synchronously windowed short-time speech signal.
  - 17. Method as claimed in claim 16, where the pitch synchronously windowed short-time speech signal is one of a succession of pitch synchronously windowed short-time speech signals and successive pitch synchronously windowed short-time speech signals are centred at consistent anchor points, such that the phase component of the corresponding successive discrete complex frequency domain input representations is consistent.
  - 18. Method as claimed in claim 15, where the compression function is a logarithmic function.
  - 19. Method as claimed in claim 15, where the frequency warping function approximates the Mel-frequency scaling function.
  - 20. Method as claimed in claim 15, where the at least one discrete complex frequency domain representation is derived from at least one short-time speech signal by padding at least one zero value to form an expanded short-time speech signal andtransforming the expanded short-time speech signal to a discrete complex frequency domain representation.
  - 21. Method as claimed in claim 15, where the frequency warped complex cepstrum representation is truncated by preserving the M_I+1 initial values and M_Ofinal values.

22. A method for providing an output magnitude and phase representation to be used for speech synthesis comprising the steps ofreceiving at least one speech description input vector, preferably a frequency warped complex cepstrum vector,projecting the speech description input vector onto an ordered non-empty set of complex basis vectors to form a vector of spectral speech description coefficients defined at equidistant input points, the N-th coefficient being equal to the inner product between the speech description input vector and the N-th basis vector,transforming the imaginary component of the spectral speech description vector to form a transformed spectral speech description vector,interpolating the set of transformed spectral speech description coefficients at a number of output points to form a vector of warped spectral speech description coefficients, where at least one output point enclosed by at least two points is not centred in the middle between its left and right neighbouring points,extracting the imaginary components of the of an ordered set of warped spectral speech description coefficients to form a real output phase representation,expanding the real components of the warped spectral speech description coefficients with a magnitude expansion function to form an output magnitude representation.
- View Dependent Claims (23, 24, 25, 26, 27)
- - 23. Method as claimed in claim 22, where the dimension of the speech description input vector is increased by adding one or more zeroes.
  - 24. Method as claimed in claim 22, where the magnitude expansion function is an exponential function.
  - 25. Method as claimed in claim 22, where the set of complex basis functions are a set of orthogonal complex exponentials.
  - 26. Method as claimed in claim 22, where the output abscissa used for interpolation are derived from warping the input abscissa from a Mel-like scale to a linear scale.
  - 27. Method as claimed in claim 22, where a phase unwrapping algorithm is used for transforming the imaginary component of the spectral speech description points.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
SVOX AG (Microsoft Corporation)
Inventors
Coorman, Geert, Wouters, Johan

Granted Patent

US 9,031,834 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/265
CPC Class Codes

G10L 13/033   Voice editing, e.g. manipul...

G10L 21/003   Changing voice quality, e.g...

G10L 21/0232   Processing in the frequency...

G10L 21/0364   for improving intelligibility

Speech Enhancement Techniques on the Power Spectrum

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Speech Enhancement Techniques on the Power Spectrum

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links