Method and apparatus for speech analysis and synthesis

US 8,280,739 B2
Filed: 04/03/2008
Issued: 10/02/2012
Est. Priority Date: 04/04/2007
Status: Active Grant

First Claim

Patent Images

1. A speech analysis method, comprising the steps of:

obtaining a speech signal and a corresponding DEGG/EGG signal;

providing the speech signal as the output of a vocal tract filter in a source-filter model taking the DEGG/EGG signal as the input; and

estimating the features of the vocal tract filter from the speech signal as the output and the DEGG/EGG signal as the input, wherein the features of the vocal tract filter are expressed by the state vectors of the vocal tract filter at selected time points, and the step of estimating is performed using Kalman filtering, wherein the Kalman filtering is a two-way, bi-directional Kalman filtering comprising a forward Kalman filtering in which a future state is estimated from a past state and a backward Kalman filtering in which a past state is estimated from a future state, and wherein the forward Kalman filtering comprises forward estimation, correction and forward recursion, the backward Kalman filtering comprises backward estimation, correction and backward recursion, and estimation results of the two-way Kalman filtering are a combination of estimation results of the forward Kalman filtering and estimation results of the backward Kalman filtering, wherein Kalman filtering is based on;

a state function
x_k=x_k-1+d_k, andan observation function
v_k=e_k^Tx_k+n_k,wherein, x_k=[x_k(0), x_k(1), . . . x_k(N−

1)]^Trepresents the state vector to be estimated of the vocal tract filter at time point k, wherein x_k=[x_k(0), x_k(1), . . . x_k(N−

1) represent N samples of the expected unit impulse response of the vocal tract filter at time k;

d_k=[d_k(0), d_k(1), . . . d_k(N−

1)]^Trepresents the disturbance added to the state vector of the vocal tract filter at time k;

e_k=[e_k, e_k-1, . . . , e_k-N+1]^Tis a vector, of which the element e_krepresents the DEGG signal inputted at time k;

v_krepresents the speech signal outputted at time k; and

n_krepresents the observation noise added to the outputted speech signal at time k, and whereinthe forward Kalman filtering comprises the steps of;

forward estimation;

x_k^˜=x_k−

1*,
P_k^˜=P_k−

1+Q correction;

K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−

1
x_k*=x_k^˜+K_k[v_k−

e_k^Tx_k^˜]
P_k=[I−

K_ke_k^T]P_k^≃forward recursion
k=k+1;

the backward Kalman filtering comprises the steps of;

backward estimation;

x_k^˜=x_k+1*;

P_k^˜=P_k+1+Q correction;

K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−

1
x_k*=x_k^˜+K_k[v_k−

e_k^˜x_k^˜]
P_k=[I−

K_ke_k^T]P_k^˜backward recursion
k=k−

1;

wherein, x_k^˜ represents the estimated state value at time point k, x_k* represents the corrected state value at time point k, P_k^˜ represents the pre-estimated value of the covariance matrix of the estimation error, P_krepresents the corrected value of the covariance matrix of the estimation error, Q represents the covariance matrix of disturbance d_k, K_krepresents the Kalman gain, r represents the variance of the observation noise n_k, I represents the unit matrix; and

the estimation results of the two-way Kalman filtering are the combination of the estimation results of the forward Kalman filtering and those of the backward Kalman filtering using the following formula;

P_k=(P_k+^−

1+P_k−^−

1)^−

1,
x_k*=P_k(P_k+^−

1x_k+*+P_k−^−

1x_k−*),wherein, P_k+, x_k+ are the estimated state value and the covariance of the estimation obtained by the forward Kalman filtering respectively, and P_k−, x_k− represent the estimated state value and the covariance of the estimation obtained by the backward Kalman filtering respectively.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a speech analysis method comprising steps of obtaining a speech signal and a corresponding DEGG/EGG signal; regarding the speech signal as the output of a vocal tract filter in a source-filter model taking the DEGG/EGG signal as the input; and estimating the features of the vocal tract filter from the speech signal as the output and the DEGG/EGG signal as the input, wherein the features of the vocal tract filter are expressed by the state vectors of the vocal tract filter at selected time points, and the step of estimating is performed using Kalman filtering.

Citations

8 Claims

1. A speech analysis method, comprising the steps of:
- obtaining a speech signal and a corresponding DEGG/EGG signal;
  
  providing the speech signal as the output of a vocal tract filter in a source-filter model taking the DEGG/EGG signal as the input; and
  
  estimating the features of the vocal tract filter from the speech signal as the output and the DEGG/EGG signal as the input, wherein the features of the vocal tract filter are expressed by the state vectors of the vocal tract filter at selected time points, and the step of estimating is performed using Kalman filtering, wherein the Kalman filtering is a two-way, bi-directional Kalman filtering comprising a forward Kalman filtering in which a future state is estimated from a past state and a backward Kalman filtering in which a past state is estimated from a future state, and wherein the forward Kalman filtering comprises forward estimation, correction and forward recursion, the backward Kalman filtering comprises backward estimation, correction and backward recursion, and estimation results of the two-way Kalman filtering are a combination of estimation results of the forward Kalman filtering and estimation results of the backward Kalman filtering, wherein Kalman filtering is based on;
  
  a state function
  x_k=x_k-1+d_k, andan observation function
  v_k=e_k^Tx_k+n_k,wherein, x_k=[x_k(0), x_k(1), . . . x_k(N−
  
  1)]^Trepresents the state vector to be estimated of the vocal tract filter at time point k, wherein x_k=[x_k(0), x_k(1), . . . x_k(N−
  
  1) represent N samples of the expected unit impulse response of the vocal tract filter at time k;
  
  d_k=[d_k(0), d_k(1), . . . d_k(N−
  
  1)]^Trepresents the disturbance added to the state vector of the vocal tract filter at time k;
  
  e_k=[e_k, e_k-1, . . . , e_k-N+1]^Tis a vector, of which the element e_krepresents the DEGG signal inputted at time k;
  
  v_krepresents the speech signal outputted at time k; and
  
  n_krepresents the observation noise added to the outputted speech signal at time k, and whereinthe forward Kalman filtering comprises the steps of;
  
  forward estimation;
  
  x_k^˜=x_k−
  
  1*,
  P_k^˜=P_k−
  
  1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^Tx_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^≃forward recursion
  k=k+1;
  
  the backward Kalman filtering comprises the steps of;
  
  backward estimation;
  
  x_k^˜=x_k+1*;
  
  P_k^˜=P_k+1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^˜x_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^˜backward recursion
  k=k−
  
  1;
  
  wherein, x_k^˜ represents the estimated state value at time point k, x_k* represents the corrected state value at time point k, P_k^˜ represents the pre-estimated value of the covariance matrix of the estimation error, P_krepresents the corrected value of the covariance matrix of the estimation error, Q represents the covariance matrix of disturbance d_k, K_krepresents the Kalman gain, r represents the variance of the observation noise n_k, I represents the unit matrix; and
  
  the estimation results of the two-way Kalman filtering are the combination of the estimation results of the forward Kalman filtering and those of the backward Kalman filtering using the following formula;
  
  P_k=(P_k+^−
  
  1+P_k−^−
  
  1)^−
  
  1,
  x_k*=P_k(P_k+^−
  
  1x_k+*+P_k−^−
  
  1x_k−*),wherein, P_k+, x_k+ are the estimated state value and the covariance of the estimation obtained by the forward Kalman filtering respectively, and P_k−, x_k− represent the estimated state value and the covariance of the estimation obtained by the backward Kalman filtering respectively.
- View Dependent Claims (2)
- - 2. The speech analysis method according to claim 1, further comprising the step of selecting and recording the estimated state values of the vocal tract filter at selected time points obtained by the Kalman filtering, as the features of the vocal tract filter.

3. A speech synthesis method, comprising the steps of:
- obtaining a DEGG/EGG signal;
  
  obtaining the features of a vocal tract filter by;
  
  obtaining a speech signal and a corresponding DEGG/EGG signal;
  
  providing the speech signal as the output of a vocal tract filter in a source-filter model taking the DEGG/EGG signal as the input; and
  
  estimating the features of the vocal tract filter from the speech signal as the output and the DEGG/EGG signal as the input, wherein the features of the vocal tract filter are expressed by the state vectors of the vocal tract filter at selected time points, and the step of estimating is performed using Kalman filtering, wherein the Kalman filtering is a two-way, bi-directional Kalman filtering comprising a forward Kalman filtering in which a future state is estimated from a past state and a backward Kalman filtering in which a past state is estimated from a future state, and wherein the forward Kalman filtering comprises forward estimation, correction and forward recursion, the backward Kalman filtering comprises backward estimation, correction and backward recursion, and estimation results of the two-way Kalman filtering are a combination of estimation results of the forward Kalman filtering and estimation results of the backward Kalman filtering; and
  
  synthesizing speech based on the DEGG/EGG signal and the obtained features of the vocal tract filter, wherein Kalman filtering is based on;
  
  a state function
  x_k=x_k-1+d_k, andan observation function
  v_k=e_k^Tx_k+n_k,wherein, x=[x_k(0), x_k(1), . . . , x_k(N−
  
  1)]^Trepresents the state vector to be estimated of the vocal tract filter at time point k, wherein x_k(0), x_k(1), . . . , x_k(N−
  
  1) represent N samples of the expected unit impulse response of the vocal tract filter at time k;
  
  d_k=[d_k(0), d_k(1), . . . , d_k(N−
  
  1)]_Trepresents the disturbance added to the state vector of the vocal tract filter at time k;
  
  e_k=[e_k, e_k-1, . . . , e_k-N+1]^Tis a vector, of which the element e_krepresents the DEGG signal inputted at time k;
  
  v_krepresents the speech at time k; and
  
  n_krepresents the observation noise added to the outputted speech signal at time k, and whereinthe forward Kalman filtering comprises the steps of;
  
  x_k^˜=x_k−
  
  1*,
  P_k^˜=P_k−
  
  1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^Tx_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^˜forward recursion
  k=k+1;
  
  the backward Kalman filtering comprises the steps of;
  
  backward estimation;
  
  backward estimation;
  
  x_k^˜=x_k+1*;
  
  P_k^˜=P_k+1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^˜x_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^˜backward recursion
  k=k−
  
  1;
  
  wherein, x_k^˜ represents the estimated state value at time point k, x_k* represents the corrected state value at time point P_k^˜ resents the re-estimated value of the covariance matrix of the estimation error, P_krepresents the corrected value of the covariance matrix of the estimation error, represents the covariance matrix of disturbance d_k, K_krepresents the Kalman gain, r represents the variance of the observation noise n_k, I represents the unit matrix; and
  
  the estimation results of the two-way Kalman filtering are the combination of the estimation results of the forward Kalman filtering and those of the backward Kalman filtering using the following formula;
  
  P_k=(P_k+^−
  
  1+P_k−^−
  
  1)^−
  
  1,
  x_k*=P_k(P_k+*+P_k−^−
  
  1x_k−*),wherein, P_k+, x_k+ are the estimated state value and the covariance of the estimation obtained by the forward Kalman filtering respectively, and P_k−, x_k− represent the estimated state value and the covariance of the estimation obtained by the backward Kalman filtering respectively.
- View Dependent Claims (4)
- - 4. The speech synthesis method according to claim 3, wherein the step of obtaining the DEGG/EGG signal comprises:
    - reconstructing a full DEGG/EGG signal using a DEGG/EGG signal of a single period based on a given fundamental frequency and time length.

5. A speech analysis apparatus, comprising:
- a processor and a storage device encoded with modules for execution by the processor, the modules including;
  
  a module for obtaining a speech signal;
  
  a module for obtaining the corresponding DEGG/EGG signal; and
  
  an estimation module for, by regarding the speech signal as the output of a vocal tract filter in a source-filter model with the DEGG/EGG signal as the input, estimating the features of the vocal tract filter from the speech signal as the output and the DEGG/EGG signal as the input, wherein the estimation module uses the state vectors of the vocal tract filter at selected time points to express the features of the vocal tract filter, and uses Kalman filtering to perform the estimation, wherein the Kalman filtering is a two-way, bi-directional Kalman filtering comprising a forward Kalman filtering in which a future state is estimated from a past state and a backward Kalman filtering in which a past state is estimated from a future state, and wherein the forward Kalman filtering comprises forward estimation, correction and forward recursion, the backward Kalman filtering comprises backward estimation, correction and backward recursion, and estimation results of the two-way Kalman filtering are a combination of estimation results of the forward Kalman filtering and estimation results of the backward Kalman filtering, wherein the Kalman filtering is based on;
  
  a state function
  x_k=x_k−
  
  1+d_k, andan observation function
  v_k=e_k^Tx_k+n_k,wherein, x_k=[x_k(0), x_k(1), . . . , x_k(N−
  
  1)]^Trepresents the state vector to be estimated of the vocal tract filter at time point k, wherein x_k(0), x_k(1), . . . , x_k(N−
  
  1) resent N samples of the expected unit impulse response of the vocal tract filter at time k;
  
  d_k=[d_k(0), d_k(1), . . . , d_k(N−
  
  1)]^Trepresents the disturbance added to the state vector of the vocal tract filter at time k;
  
  e_k=[e_k, e_k−
  
  1, . . . , e_k−
  
  N+1]^Tis a vector, of which the element e_krepresents the DEGG signal inputted at time k;
  
  v_krepresents the speech signal outputted at time k; and
  
  n_krepresents the observation noise added to the outputted speech signal at time k, and whereinthe forward Kalman filtering comprises the following steps;
  
  forward estimation;
  
  x_k^˜=x_k−
  
  1*,
  P_k^˜=P_k−
  
  1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^Tx_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^˜forward recursion
  k=k+1;
  
  the backward Kalman filtering comprises the following steps;
  
  backward estimation;
  
  x_k^˜=x_k+1*;
  
  P_k^˜=P_k+1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^˜x_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^˜backward recursion
  k=k−
  
  1;
  
  wherein, x_k^˜ pre-estimated state value at time point k, x_k* represents the corrected state value at time point P_k^˜ represents the pre-estimated value of the covariance matrix of the estimation error, P_krepresents the corrected value of the covariance matrix of the estimation error, Q represents the covariance matrix of disturbance d_k, K_krepresents the Kalman gain, r represents the variance of the observation noise n_k, represents the unit matrix; and
  
  the estimation results of the two-way Kalman filter are the combination of estimation results of the forward Kalman filter and those of the backward Kalman filtering using the following formula;
  
  P_k=(P_k+^−
  
  1+P_k−^−
  
  1)^−
  
  1,
  x_k*=P_k(P_k+*+P_k−^−
  
  1x_k−*),wherein, P_k+, x_k+ are the estimated state value and the covariance of the estimation obtained by the forward Kalman filtering respectively, and represent the estimated state value and the covariance of the estimation obtained by the backward Kalman filtering respectively.
- View Dependent Claims (6)
- - 6. The speech analysis apparatus according to claim 5, further comprising a selection and recording module for selecting and recording the estimated state values of the vocal tract filter at selected time points obtained by the Kalman filtering, as the features of the vocal tract filter.

7. A speech synthesis apparatus, comprising:
- a processor and a storage device encoded with modules for execution by the processor, the modules including;
  
  a module for obtaining a DEGG/EGG signal;
  
  a speech analysis module comprising;
  
  a module for obtaining a speech signal;
  
  a module for obtaining the corresponding DEGG/EGG signal; and
  
  an estimation module for, by regarding the speech signal as the output of a vocal tract filter in a source-filter model with the DEGG/EGG signal as the input, estimating the features of the vocal tract filter from the speech signal as the output and the DEGG/EGG signal as the input, wherein the estimation module uses the state vectors of the vocal tract filter at selected time points to express the features of the vocal tract filter, and uses Kalman filtering to perform the estimation, wherein the Kalman filtering is a two-way, bi-directional Kalman filtering comprising a forward Kalman filtering in which a future state is estimated from a past state and a backward Kalman filtering in which a past state is estimated from a future state, and wherein the forward Kalman filtering comprises forward estimation, correction and forward recursion, the backward Kalman filtering comprises backward estimation, correction and backward recursion, and estimation results of the two-way Kalman filtering are a combination of estimation results of the forward Kalman filtering and estimation results of the backward Kalman filtering; and
  
  a speech synthesis module for synthesizing a speech signal based on the DEGG/EGG signal obtained by the module for obtaining a DEGG/EGG signal and the features of the vocal tract filter estimated by the speech analysis apparatus, wherein the Kalman filtering is based on;
  
  a state function
  x_k=x_k−
  
  1+d_k, andan observation function
  v_k=e_k^Tx_k+n_k,wherein, x_k=[x_k(0), x_k(1), . . . , x_k(N−
  
  1)]^Trepresents the state vector to be estimated of the vocal tract filter at time point k, wherein x_k(0), x^k(1), . . . , x_k(N−
  
  1) represent N samples of the expected unit impulse response of the vocal tract filter at time k;
  
  d_k=[d_k(0), d_k(1), . . . , d_k(N−
  
  1)]^Trepresents the disturbance added to the state vector of the vocal tract filter at time k;
  
  e_k=[e_k, e_k−
  
  1, . . . , e_k−
  
  N+1]^Tis a vector, of which the element e_krepresents the DEGG signal inputted at time k;
  
  v_krepresents the speech signal outputted at time k; and
  
  n_krepresents the observation noise added to the outputted speech signal at time k, and whereinthe forward Kalman filtering comprises the following steps;
  
  forward estimation;
  
  x_k^˜=x_k−
  
  1*,
  P_k^˜=P_k−
  
  1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^Tx_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^˜forward recursion
  k=k+1;
  
  the backward Kalman filtering comprises the following steps;
  
  x_k^˜=x_k+1*;
  
  P_k^˜=P_k+1+Q correction;
  
  K_k=P_k^˜e_k[e_k^TP_k^˜e_k+r]^−
  
  1
  x_k*=x_k^˜+K_k[v_k−
  
  e_k^˜x_k^˜]
  P_k=[I−
  
  K_ke_k^T]P_k^˜backward recursion
  k=k−
  
  1;
  
  wherein, x_k^˜ represents the pre-estimated state value at time point k, x_k* represents the corrected state value at time point k, P_k^˜ represents the pre-estimated value of the covariance matrix of the estimation error P_krepresents the corrected value of the covariance matrix of the estimation error, Q represents the covariance matrix of disturbance d_k,K_krepresents the Kalman gain, r represents the variance of the observation noise n_k, I represents the unit matrix; and
  
  the estimation results of the two-way Kalman filter are the combination of estimation results of the forward Kalman filter and those of the backward Kalman filtering using the following formula;
  
  P_k=(P_k+^−
  
  1+P_k−^−
  
  1)^−
  
  1,
  x_k*=P_k(P_k+*+P_k−^−
  
  1x_k−*),wherein, P_k+,x_k+ are the estimated state value and the covariance of the estimation obtained by the forward Kalman filtering respectively, and P_k−, x_k− represent the estimated state value and the covariance of the estimation obtained by the backward Kalman filtering respectively.
- View Dependent Claims (8)
- - 8. The speech synthesis apparatus according to claim 7, wherein the module for obtaining a DEGG/EGG signal is further configured to reconstruct a full DEGG/EGG signal using a DEGG/EGG signal of a single period based on a given fundamental frequency and time length.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Jiang, Dan Ning, Meng, Fan Ping, Qin, Yong, Shuang, Zhi Wei
Primary Examiner(s)
Saint Cyr, Leonard

Application Number

US12/061,645
Publication Number

US 20080288258A1
Time in Patent Office

1,643 Days
Field of Search

None
US Class Current

704/261
CPC Class Codes

G10L 13/04 Details of speech synthesis...

G10L 25/48 specially adapted for parti...

Method and apparatus for speech analysis and synthesis

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for speech analysis and synthesis

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links