Speech synthesis with dynamic constraints

US 20100057467A1
Filed: 06/25/2009
Published: 03/04/2010
Est. Priority Date: 09/03/2008
Status: Active Grant

First Claim

Patent Images

1. A method for providing speech parameters to be used for synthesis of a speech utterance, comprising:

receiving an input time series of first speech parameter vectors {x_i}_{1 . . . m}allocated to synchronisation points 1 to m indexed by i, wherein each synchronisation point is defining a point in time or a time interval of the speech utterance and each first speech parameter vector x_iconsists of a number of n₁static speech parameters of a time interval of the speech utterance,preparing at least one input time series of second speech parameter vectors {Δ

_i}_{1 . . . m}allocated to the synchronisation points 1 to m, wherein each second speech parameter vector Δ

_iconsists of a number of n₂dynamic speech parameters of a time interval of the speech utterance,extracting from the input time series of first and second speech parameter vectors {x_i}_{1 . . . m}and {Δ

_i}_{1 . . . m}partial time series of first speech parameter vectors {x_i}_{p . . . q}and corresponding partial time series of second speech parameter vectors {Δ

_i}_{p . . . q}wherein p is the index of the first and q is the index of the last extracted speech parameter vector,converting the corresponding partial time series of first and second speech parameter vectors {x_i}_{p . . . q}and {Δ

_i}_{p . . . q}into partial time series of third speech parameter vectors {y_i}_{p . . . q}, wherein the partial time series of third speech parameter vectors {y_i}_{p . . . q}minimises differences to the partial time series of first speech parameter vectors {x_i}_{p . . . q}, the dynamic characteristics of {y_i}_{p . . . q}minimise differences to the partial time series of second speech parameter vectors {Δ

_i}_{p . . . q}, and the conversion is done independently for each partial time series of third speech parameter vectors {y_i}_{p . . . q}and can be started as soon as the vectors p to q of the input time series of the first speech parameter vectors {x_i}_{1 . . . m}have been received and corresponding vectors p to q of second speech parameter vectors {Δ

_i}_{1 . . . m}have been prepared, andcombining the speech parameter vectors of the partial time series of third speech parameter vectors {y_i}_{p . . . q}to form a time series of output speech parameter vectors {ŷ

_i}_{1 . . . m}allocated to the synchronisation points, wherein the time series of output speech parameter vectors {ŷ

_i}_{1 . . . m}is provided to be used for synthesis of the speech utterance.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is disclosed for providing speech parameters to be used for synthesis of a speech utterance. In at least one embodiment, the method includes receiving an input time series of first speech parameter vectors, preparing at least one input time series of second speech parameter vectors consisting of dynamic speech parameters, extracting from the input time series of first and second speech parameter vectors partial time series of first speech parameter vectors and corresponding partial time series of second speech parameter vectors, converting the corresponding partial time series of first and second speech parameter vectors into partial time series of third speech parameter vectors, wherein the conversion is done independently for each set of partial time series and can be started as soon as the vectors of the input time series of the first speech parameter vectors have been received. The speech parameter vectors of the partial time series of third speech parameter vectors are combined to form a time series of output speech parameter vectors to be used for synthesis of the speech utterance. At least one embodiment of the method allows a continuous providing of speech parameter vectors for synthesis of the speech utterance. The latency and the memory requirements for the synthesis of a speech utterance are reduced.

Citations

19 Claims

1. A method for providing speech parameters to be used for synthesis of a speech utterance, comprising:
- receiving an input time series of first speech parameter vectors {x_i}_{1 . . . m}allocated to synchronisation points 1 to m indexed by i, wherein each synchronisation point is defining a point in time or a time interval of the speech utterance and each first speech parameter vector x_iconsists of a number of n₁static speech parameters of a time interval of the speech utterance,preparing at least one input time series of second speech parameter vectors {Δ
  
  _i}_{1 . . . m}allocated to the synchronisation points 1 to m, wherein each second speech parameter vector Δ
  
  _iconsists of a number of n₂dynamic speech parameters of a time interval of the speech utterance,extracting from the input time series of first and second speech parameter vectors {x_i}_{1 . . . m}and {Δ
  
  _i}_{1 . . . m}partial time series of first speech parameter vectors {x_i}_{p . . . q}and corresponding partial time series of second speech parameter vectors {Δ
  
  _i}_{p . . . q}wherein p is the index of the first and q is the index of the last extracted speech parameter vector,converting the corresponding partial time series of first and second speech parameter vectors {x_i}_{p . . . q}and {Δ
  
  _i}_{p . . . q}into partial time series of third speech parameter vectors {y_i}_{p . . . q}, wherein the partial time series of third speech parameter vectors {y_i}_{p . . . q}minimises differences to the partial time series of first speech parameter vectors {x_i}_{p . . . q}, the dynamic characteristics of {y_i}_{p . . . q}minimise differences to the partial time series of second speech parameter vectors {Δ
  
  _i}_{p . . . q}, and the conversion is done independently for each partial time series of third speech parameter vectors {y_i}_{p . . . q}and can be started as soon as the vectors p to q of the input time series of the first speech parameter vectors {x_i}_{1 . . . m}have been received and corresponding vectors p to q of second speech parameter vectors {Δ
  
  _i}_{1 . . . m}have been prepared, andcombining the speech parameter vectors of the partial time series of third speech parameter vectors {y_i}_{p . . . q}to form a time series of output speech parameter vectors {ŷ
  
  _i}_{1 . . . m}allocated to the synchronisation points, wherein the time series of output speech parameter vectors {ŷ
  
  _i}_{1 . . . m}is provided to be used for synthesis of the speech utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 17, 18, 19)
- - 2. Method as claimed in claim 1, wherein each of the first speech parameter vectors x_iincludes a spectral domain representation of speech, preferably cepstral parameters or line spectral frequency parameters.
  - 3. Method as claimed in claim 1, wherein at least one time series of second speech parameter vectors Δ
    - _iincludes a local time derivative of the first speech parameter vectors, preferably calculated using the following regression function;
      
      $Δ_{i, j} = \frac{\sum_{k = - K}^{K} {kx}_{i + k, j}}{\sum_{k = - K}^{K} k^{2}},$ where i is the index of the first speech parameter vector in a time series analysed from recorded speech and j is the index within the vector and K is preferably 1.
  - 4. Method as claimed in claim 1, wherein at least one time series of second speech parameter vectors Δ
    - _iincludes a local spectral derivative of the first speech parameter vectors, preferably calculated using the following regression function;
      
      $Δ_{i, j}^{*} = \frac{\sum_{k = - K}^{K} {kx}_{i, j + k}}{\sum_{k = - K}^{K} k^{2}},$ where i is the index of the first speech parameter vector in a time series analysed from recorded speech and j is the index within the vector and K is preferably 1.
  - 5. Method as claimed in claim 1, wherein at least one time series of second speech parameter vectors Δ
    - _iincludes delta or acceleration coefficients, preferably calculated by taking the second time or spectral derivative of the static parameter vectors or the first derivative of the local time or spectral derivative of the static speech parameter vectors.
  - 6. Method as claimed in claim 1, wherein at least one time series of second speech parameters Δ
    - _i, consists of vectors that are zero except for entries above a predetermined threshold and the threshold is preferably a function of the standard deviation of the entry, preferably a factor α
      
      =0.5 times the standard deviation.
  - 7. Method as claimed in claim 1, wherein the step of converting is done by deriving a set of equations expressing the static and dynamic constraints and finding the weighted minimum least squares solution, wherein the set of equations is in matrix notation:
    - A_Ypq=X_pq,whereY_pqis a concatenation of the third speech parameter vectors {y_i}_{p . . . q},
      Y_pq=[y_p^T. . . Y_q^T]^T,X_pqis a concatenation of the first speech parameter vectors {x_i}_{p . . . q}and of the second speech parameter vectors {Δ
      
      _i}_{p . . . q},
      X=[x_p^T. . . x_q^TΔ
      
      _p^T. . . Δ
      
      _q^T]^T,( )^Tis the transpose operator,M corresponds to the number of vectors in the partial time series, M=q−
      
      p+1Y_pqhas a length in the form of the product Mn₁,X_pqhas a length in the form of the product M(n₁+n₂),the matrix A has a size of M(n₁+n₂) by Mn₁,the weighted minimum least squares solution is
      Y_pq=(A^TW^TW A)^−
      
      1A^TW^TWX_pq,where W is a matrix of weights with a dimension of M(n₁+n₂) by M(n₁+n₂).
  - 8. Method as claimed in claim 7, wherein the matrix of weights W is a diagonal matrix and the diagonal elements are a function of the standard deviation of the static and the dynamic parameters:
    - $w_{r, s} = {\begin{matrix} 0, & r \neq s \\ f (σ_{x_{i, j}}), & r = s = (i - p) n_{1} + j \\ f (σ_{Δ_{i, j}}), & r = s = {Mn}_{1} + (i - p) n_{2} + j \end{matrix}$ where i is the index of a vector in {x_i}_{p . . . q}or {Δ
      
      _i}_{p . . . q}, i is the index within a vector, M=q−
      
      p+1, and f( ) is preferably the inverse function ( )^−
      
      1.
  - 9. Method as claimed in claim 8, wherein X_pq, Y_pq, A, and W are quantised numerical matrices and A and W are preferably more heavily quantised than X_pqand Y_pq.
  - 10. Method as claimed in claim 8, wherein in the received time series of first speech parameter vectors {x_i}_{1 . . . m}and in the prepared at least one time series of second speech parameter vectors {Δ
    - _i}_{1 . . . m}the values x_iand Δ
      
      _ihave been multiplied with their inverse variance and the calculation of the weighted minimum least squares solution is simplified to
      Y_pq=(A^TW^TW A)^−
      
      1A^TX_pq.
  - 11. Method as claimed in claim 7, wherein each of the at least one time series of second speech parameters includes n=n₂=n₁time derivatives and AY=X is split into n independent sets of equations A_jY_j=X_jand preferably the matrices A_jof size 2 M by M are the same for each dimension j, A_j=A, j=1 . . . n.
  - 12. Method as claimed in claim 1, wherein successive partial time series {x_i}_{p . . . q}, respectively {Δ
    - _i}_{p . . . q}and {y_i}_{p . . . q}, are set to overlap by a number of vectors and the ratio of the overlap to the length of the time series is in the range of 0.03 to 0.20.
  - 13. Method as claimed in claim 1, wherein the speech parameter vectors of successive overlapping partial time series {y_i}_{p . . . q}are combined to form a time series of non overlapping speech parameter vectors {ŷ
    - _i}_{1 . . . m}by applying to the final vectors of one partial time series a scaling function that decreases with time, and by applying to the initial vectors of the successive partial time series a scaling function that increases with time, and by adding together the scaled overlapping final and initial vectors, where the increasing scaling function is preferably the first half of a Hanning function and the decreasing scaling function is preferably the second half of a Hanning function.
  - 14. Method as claimed in claim 1, wherein the speech parameter vectors of successive overlapping partial time series {y_i}_{p . . . q}are combined to form a time series of non overlapping speech parameter vectors {ŷ
    - _i}_{1 . . . m}by applying to the final vectors of one partial time series a rectangular scaling function that is 1 during the first half of the overlap region and 0 otherwise, and by applying to the initial vectors of the successive partial time series a rectangular scaling function that is 0 during the first half of the overlap region and 1 otherwise, and by adding together the scaled overlapping final and initial vectors.
  - 15. A computer program comprising program code segments for performing the method of claim 1 when said program is run on a computer.
  - 17. A computer readable medium including program segments for, when executed on a computer device, causing the computer device to implement the method of claim 1.
  - 18. Method as claimed in claim 12, wherein the ratio of the overlap to the length of the time series is in the range of 0.06 to 0.15.
  - 19. Method as claimed in claim 18, wherein the ratio of the overlap to the length of the time series is 0.10.

16. A speech synthesis processor for providing output speech parameters to be used for synthesis of a speech utterance, said processor comprising:
- receiving means for receiving an input time series of first speech parameter vectors {x_i}_{1 . . . m}allocated to synchronisation points 1 to m indexed by i, wherein each synchronisation point is defining a point in time or a time interval of the speech utterance and each first speech parameter vector x_iconsists of a number of n₁static speech parameters of a time interval of the speech utterance,preparing means for preparing at least one input time series of second speech parameter vectors {Δ
  
  _i}_{1 . . . m}allocated to the synchronisation points 1 to m, wherein each second speech parameter vector Δ
  
  _iconsists of a number of n₂dynamic speech parameters of a time interval of the speech utterance,extracting means for extracting from the input time series of first and second speech parameter vectors {x_i}_{1 . . . m}and {Δ
  
  _i}_{1 . . . m}partial time series of first speech parameter vectors {x_i}_{p . . . q}and corresponding partial time series of second speech parameter vectors {Δ
  
  _i}_{p . . . q}wherein p is the index of the first and q is the index of the last extracted speech parameter vector,converting means for converting the corresponding partial time series of first and second speech parameter vectors {x_i}_{p . . . q}and {Δ
  
  _i}_{p . . . q}into partial time series of third speech parameter vectors {y_i}_{p . . . q}, wherein the partial time series of third speech parameter vectors {y_i}_{p . . . q}minimises differences to the partial time series of first speech parameter vectors {x_i}_{p . . . q}, the dynamic characteristics of {y_i}_{p . . . q}minimise differences to the partial time series of second speech parameter vectors {Δ
  
  _i}_{p . . . q}, and the conversion is done independently for each partial time series of third speech parameter vectors {y_i}_{p . . . q}and can be started as soon as the vectors p to q of the input time series of the first speech parameter vectors {x_i}_{1 . . . m}have been received and corresponding vectors p to q of second speech parameter vectors {Δ
  
  _i}_{1 . . . m}have been prepared, andcombining means for combining the speech parameter vectors of the partial time series of third speech parameter vectors {y_i}_{p . . . q}to form a time series of output speech parameter vectors {ŷ
  
  _i}_{1 . . . m}allocated to the synchronisation points, wherein the time series of output speech parameter vectors {ŷ
  
  _i}_{1 . . . m}is provided to be used for synthesis of the speech utterance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
SVOX AG (Microsoft Corporation)
Inventors
Wouters, Johan

Granted Patent

US 8,301,451 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/267
CPC Class Codes

G10L 13/07 Concatenation rules

Speech synthesis with dynamic constraints

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesis with dynamic constraints

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links