US 7,643,990 B1 - Global boundary-centric feature extraction and associated discontinuity metrics

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Portions from time-domain speech segments are extracted. Feature vectors that represent the portions in a vector space are created. The feature vectors incorporate phase information of the portions. A distance between the feature vectors in the vector space is determined. In one aspect, the feature vectors are created by constructing a matrix W from the portions and decomposing the matrix W. In one aspect, decomposing the matrix W comprises extracting global boundary-centric features from the portions. In one aspect, the portions include at least one pitch period. In another aspect, the portions include centered pitch periods.

279 Citations

100 Claims

1. A machine-implemented method comprising:
- i. extracting, via a microprocessor, portions from speech segments, the portions surrounding a segment boundary within a phoneme;
  
  ii. identifying time samples from the portions;
  
  iii. constructing a matrix W containing first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and second data corresponding to the portions;
  
  iv. deriving feature vectors that represent the portions in a vector space by decomposing the matrix W containing the first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and the second data corresponding to the portions; and
  
  v. determining a distance between the feature vectors in the vector space.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The machine-implemented method of claim 1, wherein decomposing the matrix W comprises extracting global boundary-centric features from the portions.
  - 3. The machine-implemented method of claim 1, wherein the speech segments each include the segment boundary within the phoneme.
  - 4. The machine-implemented method of claim 3, wherein the speech segments each include at least one diphone.
  - 5. The machine-implemented method of claim 4, wherein the portions include at least one pitch period.
  - 6. The machine-implemented method of claim 5, wherein decomposing the matrix W comprises performing a pitch synchronous singular value analysis on the pitch periods of the time-domain segments.
  - 7. The machine-implemented method of claim 5, wherein the matrix W is a 2KM×
    - N matrix represented by
      W=UΣ
      
      V^Twhere K is the number of pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the 2KM×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      2KM), Σ
      
      is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      S_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      2KM, and T denotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 8. The machine-implemented method of claim 7, wherein the pitch periods are zero padded to N samples.
  - 9. The machine-implemented method of claim 8, wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iΣ
      
      where u_iis a row vector associated with a pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 10. The machine-implemented method of claim 9, wherein the distance between two feature vectors is determined by a metric comprising the cosine of the angle between the two feature vectors.
  - 11. The machine-implemented method of claim 10, wherein the metric comprises a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      2KM.
  - 12. The machine-implemented method of claim 11, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=d₀(p₁,q₁)=1−
    - C(ū
      
      _p1,ū
      
      _q1)where d₀is the distance between pitch periods p₁and q₁, p₁is the last pitch period of S₁, q₁is the first pitch period of S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 13. The machine-implemented method of claim 12, wherein the calculation for the difference between two segments in the voice table, S₁and S₂, is expanded to include a plurality of pitch periods from each segment.
  - 14. The machine-implemented method of claim 12, wherein the difference between two segments in the voice table, S₁and S₂, is associated with a discontinuity between S₁and S₂.
  - 15. The machine-implemented method of claim 11, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as $d$
    - ( S 1 , S 2 ) = 
      
      d 0 ⁡
      
      ( p 1 , q 1 ) ⁢
      
      - d 0 ⁡
      
      ( p 1 , p _ 1 ) + d 0 ⁡
      
      ( q 1 , q _ 1 ) 2 
      
      = 
      
      C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ p _ ⁢
      
      1 ) + C ⁡
      
      ( u _ q ⁢
      
      ⁢
      
      1 , u _ q _ ⁢
      
      1 ) 2 - C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ q ⁢
      
      ⁢
      
      1 ) 
      
      where d₀is the distance between pitch periods, p₁is the last pitch period of S₁, p₁is the first pitch period of a segment contiguous to S₁, q₁is the first pitch period of S₂, q₁is the last pitch period of a segment contiguous to S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, ū
      
      _q1is a feature vector associated with pitch period q₁, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 16. The machine-implemented method of claim 1, further comprising associating the distance between the feature vectors with speech segments in a voice table.
  - 17. The machine-implemented method of claim 16, further comprising:
    - selecting speech segments from the voice table based on the distance between the feature vectors.
  - 18. The machine-implemented method of claim 4, wherein the portions include centered pitch periods.
  - 19. The machine-implemented method of claim 18, wherein the matrix W is a (2(K−
    - 1)+1)M×
      
      N matrix represented by
      W=UΣ
      
      V^Twhere K−
      
      1 is the number of centered pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the centered pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the (2(K−
      
      1)+1)M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      (2(K−
      
      1)+1)M), Σ
      
      is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      (2(K−
      
      1)+1)M), and ^Tdenotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 20. The machine-implemented method of claim 19, wherein the centered pitch periods are symmetrically zero padded to N samples.
  - 21. The machine-implemented method of claim 20, wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iΣ
      
      where u_iis a row vector associated with a centered pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 22. The machine-implemented method of claim 21, wherein the distance between two feature vectors is determined by a metric comprising a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      (2(K−
      
      1)+1)M.
  - 23. The machine-implemented method of claim 22, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=C(u_π
    - _−
      
      1,uδ
      
      ₀)+C(uδ
      
      ₀,u_σ₁)−
      
      C(u_π_−
      
      1,u_π₀)−
      
      C(u_σ₀,u_σ₁)where u_π_−
      
      1is a feature vector associated with a centered pitch period π
      
      _−
      
      1, uδ
      
      ₀is a feature vector associated with a centered pitch period δ
      
      ₀, u_σ₁is a feature vector associated with a centered pitch period σ
      
      ₁, u_π₀is a feature vector associated with a centered pitch period π
      
      ₀, and u_σ₀is a feature vector associated with a centered pitch period σ
      
      ₀.

24. A machine-readable medium storing instructions to cause a machine to perform a machine-implemented method comprising:
- i. extracting portions from speech segments that surround a segment boundary within a phoneme;
  
  ii. identifying time samples from the portions;
  
  iii. constructing a matrix W containing first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and second data corresponding to the portions;
  
  iv. deriving feature vectors that represent the portions in a vector space by decomposing the matrix W containing the first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and the second data corresponding to the portions; and
  
  v. determining a distance between the feature vectors in the vector space.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 25. The machine-readable medium of claim 24, wherein decomposing the matrix W comprises extracting global boundary-centric features from the portions.
  - 26. The machine-readable medium of claim 24, wherein the speech segments each include the segment boundary within the phoneme.
  - 27. The machine-readable medium of claim 26, wherein the speech segments each include at least one diphone.
  - 28. The machine-readable medium of claim 27, wherein the portions include at least one pitch period.
  - 29. The machine-readable medium of claim 28, wherein decomposing the matrix W comprises performing a pitch synchronous singular value analysis on the pitch periods of the time-domain segments.
  - 30. The machine-readable medium of claim 28, wherein the matrix W is a 2KM×
    - N matrix represented by
      W=UΣ
      
      V^Twhere K is the number of pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the 2KM×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      2KM), is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      2KM, and ^Tdenotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 31. The machine-readable medium of claim 30, wherein the pitch periods are zero padded to N samples.
  - 32. The machine-readable medium of claim 31, wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iΣ
      
      where u_iis a row vector associated with a pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 33. The machine-readable medium of claim 32, wherein the distance between two feature vectors is determined by a metric comprising the cosine of the angle between the two feature vectors.
  - 34. The machine-readable medium of claim 33, wherein the metric comprises a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      2KM.
  - 35. The machine-readable medium of claim 34, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=d₀(p₁,q₁)=1−
    - C(ū
      
      _p1,ū
      
      _q1)where d₀is the distance between pitch periods p₁and q₁, p₁is the last pitch period of S₁, q₁is the first pitch period of S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 36. The machine-readable medium of claim 35, wherein the calculation for the difference between two segments in the voice table, S₁and S₂, is expanded to include a plurality of pitch periods from each segment.
  - 37. The machine-readable medium of claim 35, wherein the difference between two segments in the voice table, S₁and S₂, is associated with a discontinuity between S₁and S₂.
  - 38. The machine-readable medium of claim 34, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as $d$
    - ( S 1 , S 2 ) = 
      
      d 0 ⁡
      
      ( p 1 , q 1 ) ⁢
      
      - d 0 ⁡
      
      ( p 1 , p _ 1 ) + d 0 ⁡
      
      ( q 1 , q _ 1 ) 2 
      
      = 
      
      C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ p _ ⁢
      
      1 ) + C ⁡
      
      ( u _ q ⁢
      
      ⁢
      
      1 , u _ q _ ⁢
      
      1 ) 2 - C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ q ⁢
      
      ⁢
      
      1 ) 
      
      where d₀is the distance between pitch periods, p₁is the last pitch period of S₁, p₁is the first pitch period of a segment contiguous to S₁, q₁is the first pitch period of S₂, q₁is the last pitch period of a segment contiguous to S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, ū
      
      _q1is a feature vector associated with pitch period q₁, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 39. The machine-readable medium of claim 24, wherein the method further comprises associating the distance between the feature vectors with speech segments in a voice table.
  - 40. The machine-readable medium of claim 39, wherein the method further comprises:
    - selecting speech segments from the voice table based on the distance between the feature vectors.
  - 41. The machine-readable medium of claim 27, wherein the portions include centered pitch periods.
  - 42. The machine-readable medium of claim 41, wherein the matrix W is a (2(K−
    - 1)+1)M×
      
      N matrix represented by
      W=UΣ
      
      V^Twhere K−
      
      1 is the number of centered pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the centered pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the (2(K−
      
      1)+1)M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      (2(K−
      
      1)+1)M), Σ
      
      is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      (2(K−
      
      1)+1)M), and T denotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 43. The machine-readable medium of claim 42, wherein the centered pitch periods are symmetrically zero padded to N samples.
  - 44. The machine-readable medium of claim 43, wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iΣ
      
      where ū
      
      _iis a row vector associated with a centered pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 45. The machine-readable medium of claim 44, wherein the distance between two feature vectors is determined by a metric comprising a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      (2(K−
      
      1)+1)M.
  - 46. The machine-readable medium of claim 45, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=C(u_π
    - _−
      
      1,uδ
      
      ₀)+C(uδ
      
      ₀,u_σ₁)−
      
      C(u_π_−
      
      1,u_π₀)−
      
      C(u_σ₀,u_σ₁)where u_π_−
      
      1is a feature vector associated with a centered pitch period π
      
      _−
      
      1, uδ
      
      ₀is a feature vector associated with a centered pitch period δ
      
      ₀, u_σ₁is a feature vector associated with a centered pitch period σ
      
      ₁, u_π₀is a feature vector associated with a centered pitch period π
      
      ₀, and u_σ₀is a feature vector associated with a centered pitch period σ
      
      ₀.

47. An apparatus comprising:
- means for extracting portions from speech segments, the portions surrounding a segment boundary within a phoneme;
  
  means for identifying time samples from the portions;
  
  means for constructing a matrix W containing first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and second data corresponding to the portions; and
  
  means for deriving feature vectors that represent the portions in a vector space by decomposing the matrix W containing the first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and the second data corresponding to the portions; and
  
  means for determining a distance between the feature vectors in the vector space.
- View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69)
- - 48. The apparatus of claim 47, wherein the means for decomposing the matrix W comprises means for extracting global boundary-centric features from the portions.
  - 49. The apparatus of claim 47, wherein the speech segments each include the segment boundary within the phoneme.
  - 50. The apparatus of claim 49, wherein the speech segments each include at least one diphone.
  - 51. The apparatus of claim 50, wherein the portions include at least one pitch period.
  - 52. The apparatus of claim 51, wherein the means for decomposing the matrix W comprises means for performing a pitch synchronous singular value analysis on the pitch periods of the time-domain segments.
  - 53. The apparatus of claim 51, wherein the matrix W is a 2KM×
    - N matrix represented by
      W=UΣ
      
      V^Twhere K is the number of pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the 2KM×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      2KM), is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      2KM, and T denotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 54. The apparatus of claim 53, wherein the pitch periods are zero padded to N samples.
  - 55. The apparatus of claim 54, wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iΣ
      
      where u_iis a row vector associated with a pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 56. The apparatus of claim 55, wherein the distance between two feature vectors is determined by a metric comprising the cosine of the angle between the two feature vectors.
  - 57. The apparatus of claim 56, wherein the metric comprises a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      2KM.
  - 58. The apparatus of claim 57, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=d₀(p₁,q₁)=1−
    - C(ū
      
      _p1,ū
      
      _q1)where d₀is the distance between pitch periods p₁and q₁, p₁is the last pitch period of S₁, q₁is the first pitch period of S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 59. The apparatus of claim 58, wherein the calculation for the difference between two segments in the voice table, S₁and S₂, is expanded to include a plurality of pitch periods from each segment.
  - 60. The apparatus of claim 58, wherein the difference between two segments in the voice table, S₁and S₂, is associated with a discontinuity between S₁and S₂.
  - 61. The apparatus of claim 57, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as $d$
    - ( S 1 , S 2 ) = 
      
      d 0 ⁡
      
      ( p 1 , q 1 ) ⁢
      
      - d 0 ⁡
      
      ( p 1 , p _ 1 ) + d 0 ⁡
      
      ( q 1 , q _ 1 ) 2 
      
      = 
      
      C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ p _ ⁢
      
      1 ) + C ⁡
      
      ( u _ q ⁢
      
      ⁢
      
      1 , u _ q _ ⁢
      
      1 ) 2 - C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ q ⁢
      
      ⁢
      
      1 ) 
      
      where d₀is the distance between pitch periods, p₁is the last pitch period of S₁, p₁is the first pitch period of a segment contiguous to S₁, q₁is the first pitch period of S₂, q₁is the last pitch period of a segment contiguous to S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, ū
      
      _q1is a feature vector associated with pitch period q₁, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 62. The apparatus of claim 47, further comprising means for associating the distance between the feature vectors with speech segments in a voice table.
  - 63. The apparatus of claim 62, further comprising:
    - means for selecting speech segments from the voice table based on the distance between the feature vectors.
  - 64. The apparatus of claim 50, wherein the portions include centered pitch periods.
  - 65. The apparatus of claim 64, wherein the matrix W is a (2(K−
    - 1)+1)M×
      
      N matrix represented by
      W=UΣ
      
      V^Twhere K−
      
      1 is the number of centered pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the centered pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the (2(K−
      
      1)+1)M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      (2(K−
      
      1)+1)M), Σ
      
      is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      (2(K−
      
      1)+1)M), and T denotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 66. The apparatus of claim 65, wherein the centered pitch periods are symmetrically zero padded to N samples.
  - 67. The apparatus of claim 66, wherein a feature vector i is calculated as
    ū
    - _i=u_iΣ
      
      where u_iis a row vector associated with a centered pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 68. The apparatus of claim 67, wherein the distance between two feature vectors is determined by a metric comprising a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      (2(K−
      
      1)+1)M.
  - 69. The apparatus of claim 68, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=C(u_π
    - _−
      
      1,uδ
      
      ₀)+C(uδ
      
      ₀,u_σ₁)−
      
      C(u_π_−
      
      1,u_π₀)−
      
      C(u_σ₀,u_σ₁)where u_π_−
      
      1is a feature vector associated with a centered pitch period π
      
      _−
      
      1, uδ
      
      ₀is a feature vector associated with a centered pitch period δ
      
      ₀, u_σ₁is a feature vector associated with a centered pitch period σ
      
      ₁, u_π₀is a feature vector associated with a centered pitch period π
      
      ₀, and u_σ₀is a feature vector associated with a centered pitch period σ
      
      ₀.

70. A system comprising:
- a processing unit coupled to a memory through a bus; and
  
  wherein the processing unit is configured, for a process, to extract portions from speech segments, the portions surrounding a segment boundary within a phoneme, identify time samples from the portions;
  
  construct a matrix W containing first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and second data corresponding to the portions, and derive feature vectors that represent the portions in a vector space by decomposing the matrix W containing the first data corresponding to the time samples from the portions surrounding the segment boundary within the phoneme and the second data corresponding to the portions, and determine a distance between the feature vectors in the vector space.
- View Dependent Claims (71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92)
- - 71. The system of claim 70, wherein the process further causes the processing unit, when decomposing the matrix W, to extract global boundary-centric features from the portions.
  - 72. The system of claim 70, wherein the speech segments each include the segment boundary within the phoneme.
  - 73. The system of claim 72, wherein the speech segments each include at least one diphone.
  - 74. The system of claim 73, wherein the portions include at least one pitch period.
  - 75. The system of claim 74, wherein the process further causes the processing unit, when decomposing the matrix W, to perform a pitch synchronous singular value analysis on the pitch periods of the time-domain segments.
  - 76. The system of claim 74, wherein the matrix W is a 2KM×
    - N matrix represented by
      W=Uτ
      
      V^Twhere K is the number of pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the 2KM×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      2KM), Σ
      
      is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      2KM, and T denotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 77. The system of claim 76, wherein the pitch periods are zero padded to N samples.
  - 78. The system of claim 77, wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iΣ
      
      where u_iis a row vector associated with a pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 79. The system of claim 78, wherein the distance between two feature vectors is determined by a metric comprising the cosine of the angle between the two feature vectors.
  - 80. The system of claim 79, wherein the metric comprises a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      2KM.
  - 81. The system of claim 80, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=d₀(p₁,q₁)=1−
    - C(ū
      
      _p1,ū
      
      _q1)where d₀is the distance between pitch periods p₁and q₁, p₁is the last pitch period of S₁, q₁is the first pitch period of S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 82. The system of claim 81, wherein the calculation for the difference between two segments in the voice table, S₁and S₂, is expanded to include a plurality of pitch periods from each segment.
  - 83. The system of claim 81, wherein the difference between two segments in the voice table, S₁and S₂, is associated with a discontinuity between S₁and S₂.
  - 84. The system of claim 80, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as $d$
    - ( S 1 , S 2 ) = 
      
      d 0 ⁡
      
      ( p 1 , q 1 ) ⁢
      
      - d 0 ⁡
      
      ( p 1 , p _ 1 ) + d 0 ⁡
      
      ( q 1 , q _ 1 ) 2 
      
      = 
      
      C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ p _ ⁢
      
      1 ) + C ⁡
      
      ( u _ q ⁢
      
      ⁢
      
      1 , u _ q _ ⁢
      
      1 ) 2 - C ⁡
      
      ( u _ p ⁢
      
      ⁢
      
      1 , u _ q ⁢
      
      ⁢
      
      1 ) 
      
      where d₀is the distance between pitch periods, p₁is the last pitch period of S₁, p₁is the first pitch period of a segment contiguous to S₁, q₁is the first pitch period of S₂, q₁is the last pitch period of a segment contiguous to S₂, ū
      
      _p1is a feature vector associated with pitch period p₁, ū
      
      _q1is a feature vector associated with pitch period q₁, ū
      
      _p1is a feature vector associated with pitch period p₁, and ū
      
      _q1is a feature vector associated with pitch period q₁.
  - 85. The system of claim 70, wherein the process further causes the processing unit to associate the distance between the feature vectors with speech segments in a voice table.
  - 86. The system of claim 85, wherein the process further causes the processing unit to select speech segments from the voice table based on the distance between the feature vectors.
  - 87. The system of claim 73, wherein the portions include centered pitch periods.
  - 88. The system of claim 87, wherein the matrix W is a (2(K−
    - 1)+1)M×
      
      N matrix represented by
      W=UΣ
      
      V^Twhere K−
      
      1 is the number of centered pitch periods near the segment boundary extracted from each segment, N is the maximum number of samples among the centered pitch periods, M is the number of segments in a voice table having a segment boundary within the phoneme, U is the (2(K−
      
      1)+1)M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      (2(K−
      
      1)+1)M), Σ
      
      is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R<
      
      <
      
      (2(K−
      
      1)+1)M), and ^Tdenotes matrix transposition, wherein decomposing the matrix W comprises performing a singular value decomposition of W.
  - 89. The system of claim 88, wherein the centered pitch periods are symmetrically zero padded to N samples.
  - 90. The system of claim 89, wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iΣ
      
      where u_iis a row vector associated with a centered pitch period i, and Σ
      
      is the singular diagonal matrix.
  - 91. The system of claim 90, wherein the distance between two feature vectors is determined by a metric comprising a closeness measure, C, between two feature vectors, ū
    - _kand ū
      
      _l, wherein C is calculated as $C ({\overline{u}}_{k}, {\overline{u}}_{l}) = \cos (u_{k} Σ, u_{l} Σ) = \frac{u_{k} Σ^{2} u_{l}^{T}}{ u_{k} Σ   u_{l} Σ }$ for any 1≦
      
      k, l≦
      
      (2(K−
      
      1)+1)M.
  - 92. The system of claim 91, wherein a difference d(S₁,S₂) between two segments in the voice table, S₁and S₂, is calculated as
    d(S₁,S₂)=C(u_π
    - _−
      
      1,uδ
      
      ₀)+C(uδ
      
      ₀,u_σ₁)−
      
      C(u_π_−
      
      1,u_π₀)−
      
      C(u_σ₀,u_σ₁)where u_π_−
      
      1is a feature vector associated with a centered pitch period π
      
      _−
      
      1, uδ
      
      ₀is a feature vector associated with a centered pitch period δ
      
      ₀, u_σ₁is a feature vector associated with a centered pitch period σ
      
      ₁, u_π₀is a feature vector associated with a centered pitch period π
      
      ₀, and u_σ₀is a feature vector associated with a centered pitch period σ
      
      ₀.

93. A machine-implemented method comprising:
- i. gathering, via a microprocessor, time-domain samples from recorded speech segments, wherein the time-domain samples include time samples of pitch periods surrounding a segment boundary within a phoneme;
  
  ii. constructing a matrix containing first data corresponding to the time samples of the pitch periods surrounding the segment boundary within the phoneme and second data corresponding to the pitch periods and deriving feature vectors that represent the time samples in a vector space by decomposing the matrix containing the first data corresponding to the time samples of the pitch periods surrounding the segment boundary within the phoneme and the second data corresponding to the pitch periods; and
  
  iii. determining a discontinuity between the segments, the discontinuity based on a distance between the features.
- View Dependent Claims (94)
- - 94. The machine-implemented method of claim 93, wherein the features incorporate phase information of the pitch periods.

95. A machine-readable medium storing instructions to cause a machine to perform a machine-implemented method comprising:
- i. gathering time-domain samples from recorded speech segments, wherein the time-domain samples include time samples of pitch periods surrounding a segment boundary within a phoneme;
  
  ii. constructing a matrix containing first data corresponding to the time samples of the pitch periods surrounding the segment boundary within the phoneme and second data corresponding to the pitch periods and deriving feature vectors that represent the time samples in a vector space by decomposing the matrix containing the first data corresponding to the time samples of the pitch periods surrounding the segment boundary within the phoneme and the second data corresponding the pitch periods; and
  
  iii. determining a discontinuity between the segments, the discontinuity based on a distance between the features.
- View Dependent Claims (96)
- - 96. The machine-readable medium of claim 95, wherein the features incorporate phase information of the pitch periods.

97. An apparatus comprising:
- means for gathering time-domain samples from recorded speech segments, wherein the time-domain samples include time samples of pitch periods surrounding a segment boundary within a phoneme;
  
  means for constructing a matrix containing first data corresponding to the time samples of the pitch periods surrounding the segment boundary within the phoneme and second data corresponding to the pitch periods and deriving feature vectors that represent the time samples in a vector space by decomposing the matrix containing the first data corresponding to the time domain samples of the pitch periods surrounding the segment boundary within the phoneme and the second data corresponding to the pitch periods;
  
  means for determining a discontinuity between the segments, the discontinuity based on a distance between the features.
- View Dependent Claims (98)
- - 98. The apparatus of claim 97, wherein the features incorporate phase information of the pitch periods.

99. A system comprising:
- a processing unit coupled to a memory through a bus; and
  
  a process executed from the memory by the processing unit to cause the processing unit to gather time-domain samples from recorded speech segments, wherein the time-domain samples include time samples of pitch periods surrounding a segment boundary within a phoneme, constructing a matrix containing first data corresponding to the time-domain samples of the pitch periods surrounding the segment boundary within the phoneme and second data corresponding to the pitch periods and deriving feature vectors that represent the time samples in a vector space by decomposing the matrix containing the first data corresponding to the time domain samples of the pitch periods surrounding the segment boundary within the phoneme and the second data corresponding to the pitch periods; and
  
  determine a discontinuity between the segments, the discontinuity based on a distance between the features.
- View Dependent Claims (100)
- - 100. The system of claim 99, wherein the features incorporate phase information of the pitch periods.

Specification

Global boundary-centric feature extraction and associated discontinuity metrics

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

279 Citations

100 Claims

Specification

Solutions

Use Cases

Quick Links

Global boundary-centric feature extraction and associated discontinuity metrics

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

279 Citations

100 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links