Methods and apparatus related to pruning for concatenative text-to-speech synthesis

US 20080091428A1
Filed: 10/10/2006
Published: 04/17/2008
Est. Priority Date: 10/10/2006
Status: Active Grant

First Claim

Patent Images

1. A machine-implemented method comprising:

pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides, among other things, automatic identification of near-redundant units in a large TTS voice table, identifying which units are distinctive enough to keep and which units are sufficiently redundant to discard. According to an aspect of the invention, pruning is treated as a clustering problem in a suitable feature space. All instances of a given unit (e.g. word or characters expressed as Unicode strings) are mapped onto the feature space, and cluster units in that space using a suitable similarity measure. Since all units in a given cluster are, by construction, closely related from the point of view of the measure used, they are suitably redundant and can be replaced by a single instance. The disclosed method can detect near-redundancy in TTS units in a completely unsupervised manner, based on an original feature extraction and clustering strategy. Each unit can be processed in parallel, and the algorithm is totally scalable, with a pruning factor determinable by a user through the near-redundancy criterion. In an exemplary implementation, a matrix-style modal analysis via Singular Value Decomposition (SVD) is performed on the matrix of the observed instances for the given word unit, resulting in each row of the matrix associated with a feature vector, which can then be clustered using an appropriate closeness measure. Pruning results by mapping each instance to the centroid of its cluster.

121 Citations

104 Claims

1. A machine-implemented method comprising:
- pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The machine-implemented method of claim 1 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 3. The machine-implemented method of claim 1 wherein the feature vectors incorporate phase information of the instances.
  - 4. The machine-implemented method of claim 1 wherein the plurality of speech segments are stored in a voice table.
  - 5. The machine-implemented method of claim 1 further comprising:
    - recording speech input;
      
      identifying the speech segments within the speech input; and
      
      identifying the instances within the speech segments.
  - 6. The machine-implemented method of claim 1 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors uⁱ(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

7. A machine-readable medium having instructions to cause a machine to perform a machine-implemented method comprising:
- pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The machine-readable medium of claim 7 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 9. The machine-readable medium of claim 7 wherein the feature vectors incorporate phase information of the instances.
  - 10. The machine-readable medium of claim 7 wherein the plurality of speech segments are stored in a voice table.
  - 11. The machine-readable medium of claim 7 wherein the method further comprises:
    - recording speech input;
      
      identifying the speech segments within the speech input; and
      
      identifying the instances within the speech segments.
  - 12. The machine-readable medium of claim 7 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

13. An apparatus comprising:
- means for automatically pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The apparatus of claim 13 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 15. The apparatus of claim 13 wherein the feature vectors incorporate phase information of the instances.
  - 16. The apparatus of claim 13 wherein the plurality of speech segments are stored in a voice table.
  - 17. The apparatus of claim 13 further comprising:
    - means for recording speech input;
      
      means for identifying the speech segments within the speech input; and
      
      means for identifying the instances within the speech segments.
  - 18. The apparatus of claim 13 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

19. A system comprising:
- a processing unit coupled to a memory through a bus; and
  
  a process executed from the memory by the processing unit to cause the processing unit to;
  
  prune redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. The system of claim 19 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 21. The system of claim 19 wherein the feature vectors incorporate phase information of the instances.
  - 22. The system of claim 19 wherein the plurality of speech segments are stored in a voice table.
  - 23. The system of claim 19 wherein the process further causes the processing unit to:
    - record speech input;
      
      identify the speech segments within the speech input; and
      
      identify the instances within the speech segments.
  - 24. The system of claim 19 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

25. A redundancy pruned voice table for use in a text-to-speech synthesis system.
- View Dependent Claims (26, 27, 28, 29)
- - 26. A redundancy pruned voice table as in claim 25, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
    - pruning redundancy of instances in the original voice table, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.
  - 27. The redundancy pruned voice table of claim 26 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 28. The redundancy pruned voice table of claim 26 wherein the feature vectors incorporate phase information of the instances.
  - 29. The redundancy pruned voice table of claim 26 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors us (1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

30. A text-to-speech synthesis system comprising a redundancy pruned voice table.
- View Dependent Claims (31, 32, 33, 34)
- - 31. A text-to-speech synthesis system as in claim 30, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
    - pruning redundancy of instances in the original voice table, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.
  - 32. The text-to-speech synthesis system of claim 31 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 33. The text-to-speech synthesis system of claim 31 wherein the feature vectors incorporate phase information of the instances.
  - 34. The text-to-speech synthesis system of claim 31 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

35. A machine-implemented method comprising:
- identifying instances in a plurality of speech segments;
  
  creating feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62)
- - 36. The machine-implemented method of claim 35 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 37. The machine-implemented method of claim 35 wherein the feature vectors incorporate phase information of the instances.
  - 38. The machine-implemented method of claim 35 wherein the plurality of speech segments are stored in a voice table.
  - 39. The machine-implemented method of claim 35 further comprising:
    - recording speech input; and
      
      identifying the speech segments within the speech input.
  - 40. The machine-implemented method of claim 35 wherein the predetermined cluster radius is controlled by a user.
  - 41. The machine-implemented method of claim 35 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 42. The machine-implemented method of claim 35 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 43. The machine-implemented method of claim 42 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 44. The machine-implemented method of claim 43 wherein the matrix W is zero padded to N samples.
  - 45. The machine-implemented method of claim 42 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=U S V^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 46. The machine-implemented method of claim 45 wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 47. The machine-implemented method of claim 46 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 48. The machine-implemented method of claim 35 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.
  - 50. The machine-readable medium of claim 35 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 51. The machine-readable medium of claim 35 wherein the feature vectors incorporate phase information of the instances.
  - 52. The machine-readable medium of claim 35 wherein the plurality of speech segments are stored in a voice table.
  - 53. The machine-readable medium of claim 35 wherein the method further comprises:
    - recording speech input; and
      
      identifying the speech segments within the speech input.
  - 54. The machine-readable medium of claim 35 wherein the predetermined cluster radius is controlled by a user.
  - 55. The machine-readable medium of claim 35 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 56. The machine-readable medium of claim 35 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 57. The machine-readable medium of claim 42 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 58. The machine-readable medium of claim 43 wherein the matrix W is zero padded to N samples.
  - 59. The machine-readable medium of claim 42 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=U S V^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 60. The machine-readable medium of claim 45 wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=ū
      
      _iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 61. The machine-readable medium of claim 46 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 62. The machine-readable medium of claim 35 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.

49. A machine-readable medium having instructions to cause a machine to perform a machine-implemented method comprising:
- identifying instances in a plurality of speech segments;
  
  creating feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance.

63. An apparatus comprising:
- means for identifying instances in a plurality of speech segments;
  
  means for creating feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space;
  
  means for clustering the feature vectors using a similarity measure in the feature space; and
  
  means for replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance.
- View Dependent Claims (64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76)
- - 64. The apparatus of claim 63 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 65. The apparatus of claim 63 wherein the feature vectors incorporate phase information of the instances.
  - 66. The apparatus of claim 63 wherein the plurality of speech segments are stored in a voice table.
  - 67. The apparatus of claim 63 further comprising:
    - means for recording speech input; and
      
      means for identifying the speech segments within the speech input.
  - 68. The apparatus of claim 63 wherein the predetermined cluster radius is controlled by a user.
  - 69. The apparatus of claim 63 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 70. The apparatus of claim 63 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 71. The apparatus of claim 70 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 72. The apparatus of claim 71 wherein the matrix W is zero padded to N samples.
  - 73. The apparatus of claim 70 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=U S V^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 74. The apparatus of claim 73 wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 75. The apparatus of claim 74 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 76. The apparatus of claim 63 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.

77. A system comprising:
- a processing unit coupled to a memory through a bus; and
  
  a process executed from the memory by the processing unit to cause the processing unit to;
  
  identify instances in a plurality of speech segments;
  
  create feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space;
  
  cluster the feature vectors using a similarity measure in the feature space; and
  
  replace the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance.
- View Dependent Claims (78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90)
- - 78. The system of claim 77 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 79. The system of claim 77 wherein the feature vectors incorporate phase information of the instances.
  - 80. The system of claim 77 wherein the plurality of speech segments are stored in a voice table.
  - 81. The system of claim 77 wherein the process further causes the processing unit to:
    - recording speech input; and
      
      identifying the speech segments within the speech input.
  - 82. The system of claim 77 wherein the predetermined cluster radius is controlled by a user.
  - 83. The system of claim 77 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 84. The system of claim 77 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 85. The system of claim 84 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 86. The system of claim 85 wherein the matrix W is zero padded to N samples.
  - 87. The system of claim 84 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=U S V^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 88. The system of claim 87 wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=ū
      
      _iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 89. The system of claim 88 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 90. The system of claim 77 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.

91. A voice table for use in a text-to-speech synthesis system, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
- identifying instances in the original voice table;
  
  creating feature vectors derived from a machine perception transformation of speech segments in the original voice table onto a feature space;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance.
- View Dependent Claims (92, 93, 94, 95, 96)
- - 92. The voice table of claim 91 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 93. The voice table of claim 91 wherein the feature vectors incorporate phase information of the instances.
  - 94. The voice table of claim 91 wherein the predetermined cluster radius is controlled by a user.
  - 95. The voice table of claim 91 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 96. The voice table of claim 91 wherein the feature vectors represent the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W, wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

97. A text-to-speech synthesis system comprising a voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
- identifying instances in the original voice table;
  
  creating feature vectors derived from a machine perception transformation of speech segments in the original voice table onto a feature space;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance.
- View Dependent Claims (98, 99, 100, 101, 102)
- - 98. The text-to-speech synthesis system of claim 97 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 99. The text-to-speech synthesis system of claim 97 wherein the feature vectors incorporate phase information of the instances.
  - 100. The text-to-speech synthesis system of claim 97 wherein the predetermined cluster radius is controlled by a user.
  - 101. The text-to-speech synthesis system of claim 97 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 102. The text-to-speech synthesis system of claim 97 wherein the feature vectors represent the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=U S V^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

103. A machine readable medium containing executable instructions which when executed by a machine cause the machine to perform a method comprising:
- receiving an input which comprises text;
  
  retrieving data from a voice table, stored in a machine readable medium, the voice table having redundant instances pruned according to a redundancy criterion based on a similarity measure between feature vectors derived from a machine perception transformation of speech segments in the voice table.
- View Dependent Claims (104)
- - 104. A medium as in claim 103 wherein clustered instances are represented by a representative instance and wherein the redundancy criterion is based at least in part on phase information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Bellegarda, Jerome R.

Granted Patent

US 8,024,193 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/254
CPC Class Codes

G10L 13/06 Elementary speech units use...

Methods and apparatus related to pruning for concatenative text-to-speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

121 Citations

104 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus related to pruning for concatenative text-to-speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

121 Citations

104 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links