Methods and apparatus related to pruning for concatenative text-to-speech synthesis

US 8,024,193 B2
Filed: 10/10/2006
Issued: 09/20/2011
Est. Priority Date: 10/10/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A machine-implemented method comprising:

pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides, among other things, automatic identification of near-redundant units in a large TTS voice table, identifying which units are distinctive enough to keep and which units are sufficiently redundant to discard. According to an aspect of the invention, pruning is treated as a clustering problem in a suitable feature space. All instances of a given unit (e.g. word or characters expressed as Unicode strings) are mapped onto the feature space, and cluster units in that space using a suitable similarity measure. Since all units in a given cluster are, by construction, closely related from the point of view of the measure used, they are suitably redundant and can be replaced by a single instance. The disclosed method can detect near-redundancy in TTS units in a completely unsupervised manner, based on an original feature extraction and clustering strategy. Each unit can be processed in parallel, and the algorithm is totally scalable, with a pruning factor determinable by a user through the near-redundancy criterion. In an exemplary implementation, a matrix-style modal analysis via Singular Value Decomposition (SVD) is performed on the matrix of the observed instances for the given word unit, resulting in each row of the matrix associated with a feature vector, which can then be clustered using an appropriate closeness measure. Pruning results by mapping each instance to the centroid of its cluster.

Citations

102 Claims

1. A machine-implemented method comprising:
- pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The machine-implemented method of claim 1 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit and wherein a first set of the instances subjected to redundancy pruning are clustered with a first feature vector and a second set of the instances subjected to redundancy pruning are clustered with a second feature vector that is discernably separated from the first feature vector.
  - 3. The machine-implemented method of claim 1 wherein the feature vectors incorporate phase information of the instances.
  - 4. The machine-implemented method of claim 1 wherein the plurality of speech segments are stored in a voice table.
  - 5. The machine-implemented method of claim 1 further comprising:
    - recording speech input;
      
      identifying the speech segments within the speech input; and
      
      identifying the instances within the speech segments.
  - 6. The machine-implemented method of claim 1 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

7. A machine-readable non-transitory storage medium having instructions to cause a machine to perform a machine-implemented method comprising:
- pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments,wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system, andwherein the redundancy pruning is performed on a representation of voice units, the representation being stored in a memory of a data processing system which includes a processor which performs the pruning.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The machine-readable medium of claim 7 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit and wherein a first set of the instances subjected to redundancy pruning are clustered with a first feature vector and a second set of the instances subjected to redundancy pruning are clustered with a second feature vector that is discernably separated from the first feature vector.
  - 9. The machine-readable medium of claim 7 wherein the feature vectors incorporate phase information of the instances.
  - 10. The machine-readable medium of claim 7 wherein the plurality of speech segments are stored in a voice table.
  - 11. The machine-readable medium of claim 7 wherein the method further comprises:
    - recording speech input;
      
      identifying the speech segments within the speech input; and
      
      identifying the instances within the speech segments.
  - 12. The machine-readable medium of claim 7 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

13. An apparatus comprising:
- means for automatically pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The apparatus of claim 13 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit and wherein a first set of the instances subjected to redundancy pruning are clustered with a first feature vector and a second set of the instances subjected to redundancy pruning are clustered with a second feature vector that is discernably separated from the first feature vector.
  - 15. The apparatus of claim 13 wherein the feature vectors incorporate phase information of the instances.
  - 16. The apparatus of claim 13 wherein the plurality of speech segments are stored in a voice table.
  - 17. The apparatus of claim 13 further comprising:
    - means for recording speech input;
      
      means for identifying the speech segments within the speech input; and
      
      means for identifying the instances within the speech segments.
  - 18. The apparatus of claim 13 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

19. A system comprising:
- a processing unit coupled to a memory through a bus; and
  
  a process executed from the memory by the processing unit to cause the processing unit to;
  
  prune redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. The system of claim 19 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit and wherein a first set of the instances subjected to redundancy pruning are clustered with a first feature vector and a second set of the instances subjected to redundancy pruning are clustered with a second feature vector that is discernably separated from the first feature vector.
  - 21. The system of claim 19 wherein the feature vectors incorporate phase information of the instances.
  - 22. The system of claim 19 wherein the plurality of speech segments are stored in a voice table.
  - 23. The system of claim 19 wherein the process further causes the processing unit to:
    - record speech input;
      
      identify the speech segments within the speech input; and
      
      identify the instances within the speech segments.
  - 24. The system of claim 19 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _jwherein C is calculated as

25. A redundancy pruned voice table comprising a redundancy pruned voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
- pruning redundancy of instances in the original voice table, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system.
- View Dependent Claims (26, 27, 28)
- - 26. The redundancy pruned voice table of claim 25 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit and wherein a first set of the instances subjected to redundancy pruning are clustered with a first feature vector and a second set of the instances subjected to redundancy pruning are clustered with a second feature vector that is discernably separated from the first feature vector.
  - 27. The redundancy pruned voice table of claim 25 wherein the feature vectors incorporate phase information of the instances.
  - 28. The redundancy pruned voice table of claim 25 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

29. A text-to-speech synthesis system comprising a redundancy pruned voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
- pruning redundancy of instances in the original voice table, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system.
- View Dependent Claims (30, 31, 32)
- - 30. The text-to-speech synthesis system of claim 29 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit and wherein a first set of the instances subjected to redundancy pruning are clustered with a first feature vector and a second set of the instances subjected to redundancy pruning are clustered with a second feature vector that is discernably separated from the first feature vector.
  - 31. The text-to-speech synthesis system of claim 29 wherein the feature vectors incorporate phase information of the instances.
  - 32. The text-to-speech synthesis system of claim 29 wherein the feature vectors representing the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

33. A machine-implemented method comprising:
- identifying instances in a plurality of speech segments;
  
  creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance.
- View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
- - 34. The machine-implemented method of claim 33 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 35. The machine-implemented method of claim 33 wherein the feature vectors incorporate phase information of the instances.
  - 36. The machine-implemented method of claim 33 wherein the plurality of speech segments are stored in a voice table.
  - 37. The machine-implemented method of claim 33 further comprising:
    - recording speech input; and
      
      identifying the speech segments within the speech input.
  - 38. The machine-implemented method of claim 33 wherein the cluster radius is controlled by a user.
  - 39. The machine-implemented method of claim 33 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 40. The machine-implemented method of claim 33 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 41. The machine-implemented method of claim 40 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 42. The machine-implemented method of claim 41 wherein the matrix W is zero padded to N samples.
  - 43. The machine-implemented method of claim 40 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=USV^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 44. The machine-implemented method of claim 43 wherein a feature vector ū
    - _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 45. The machine-implemented method of claim 44 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 46. The machine-implemented method of claim 33 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.

47. A machine-readable non-transitory storage medium having instructions to cause a machine to perform a machine-implemented method comprising:
- identifying instances in a plurality of speech segments;
  
  creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance,wherein the identifying instances, the creating feature vectors, the clustering feature vectors, and the replacing clustered instances are performed on a representation of speech segments, the representation being stored in a memory of a data processing system which includes a processor which performs the pruning.
- View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
- - 48. The machine-readable medium of claim 47 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 49. The machine-readable medium of claim 47 wherein the feature vectors incorporate phase information of the instances.
  - 50. The machine-readable medium of claim 47 wherein the plurality of speech segments are stored in a voice table.
  - 51. The machine-readable medium of claim 47 wherein the method further comprises:
    - recording speech input; and
      
      identifying the speech segments within the speech input.
  - 52. The machine-readable medium of claim 47 wherein the cluster radius is controlled by a user.
  - 53. The machine-readable medium of claim 47 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 54. The machine-readable medium of claim 47 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 55. The machine-readable medium of claim 54 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 56. The machine-readable medium of claim 55 wherein the matrix W is zero padded to N samples.
  - 57. The machine-readable medium of claim 54 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=USV^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 58. The machine-readable medium of claim 57 wherein a feature vector ū
    - _iis calculated as
      ū
      
      _iu_iSwhere u, is a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 59. The machine-readable medium of claim 58 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 60. The machine-readable medium of claim 47 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.

61. An apparatus comprising:
- means for identifying instances in a plurality of speech segments;
  
  means for creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system;
  
  means for clustering the feature vectors using a similarity measure in the feature space; and
  
  means for replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance.
- View Dependent Claims (62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74)
- - 62. The apparatus of claim 61 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 63. The apparatus of claim 61 wherein the feature vectors incorporate phase information of the instances.
  - 64. The apparatus of claim 61 wherein the plurality of speech segments are stored in a voice table.
  - 65. The apparatus of claim 61 further comprising:
    - means for recording speech input; and
      
      means for identifying the speech segments within the speech input.
  - 66. The apparatus of claim 61 wherein the cluster radius is controlled by a user.
  - 67. The apparatus of claim 61 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 68. The apparatus of claim 61 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 69. The apparatus of claim 68 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 70. The apparatus of claim 69 wherein the matrix W is zero padded to N samples.
  - 71. The apparatus of claim 68 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=USV^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 72. The apparatus of claim 71 wherein a feature vector is calculated as
    ū
    - _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 73. The apparatus of claim 72 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 74. The apparatus of claim 61 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.

75. A system comprising:
- a processing unit coupled to a memory through a bus; and
  
  a process executed from the memory by the processing unit to cause the processing unit to;
  
  identify instances in a plurality of speech segments;
  
  create feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system;
  
  cluster the feature vectors using a similarity measure in the feature space; and
  
  replace the clustered instances corresponding to the clustered feature vectors within a radius by a single instance.
- View Dependent Claims (76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88)
- - 76. The system of claim 75 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 77. The system of claim 75 wherein the feature vectors incorporate phase information of the instances.
  - 78. The system of claim 75 wherein the plurality of speech segments are stored in a voice table.
  - 79. The system of claim 75 wherein the process further causes the processing unit to:
    - recording speech input; and
      
      identifying the speech segments within the speech input.
  - 80. The system of claim 75 wherein the cluster radius is controlled by a user.
  - 81. The system of claim 75 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 82. The system of claim 75 wherein creating feature vectors comprises:
    - constructing a matrix W from the instances; and
      
      decomposing the matrix W.
  - 83. The system of claim 82 wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance,wherein constructing the matrix W comprises inputting the numbers of segment samples corresponding to the instances.
  - 84. The system of claim 83 wherein the matrix W is zero padded to N samples.
  - 85. The system of claim 82 wherein decomposing the matrix W comprises performing a singular value decomposition of W, represented by
    W=USV^Twhere M is the number of instances, M is the maximum number of segments corresponding to an instance, U is the M×
    - R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition.
  - 86. The system of claim 85 wherein a feature vector is calculated as
    ū
    - _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix.
  - 87. The system of claim 86 wherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
    - _iand ū
      
      _j, wherein C is calculated as
  - 88. The system of claim 75 wherein the clustering process comprises a sequentially clustering process, wherein the sequentially clustering process comprises a coarse partition into a set of superclusters, and a fine partition of the superclusters into a set of clusters.

89. A voice table for use in a text-to-speech synthesis system, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
- identifying instances in the original voice table;
  
  creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances of speech segments in the original voice table onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance.
- View Dependent Claims (90, 91, 92, 93, 94)
- - 90. The voice table of claim 89 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 91. The voice table of claim 89 wherein the feature vectors incorporate phase information of the instances.
  - 92. The voice table of claim 89 wherein the cluster radius is controlled by a user.
  - 93. The voice table of claim 89 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 94. The voice table of claim 89 wherein the feature vectors represent the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

95. A text-to-speech synthesis system comprising a voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
- identifying instances in the original voice table;
  
  creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances of speech segments in the original voice table onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments;
  
  clustering the feature vectors using a similarity measure in the feature space; and
  
  replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance.
- View Dependent Claims (96, 97, 98, 99, 100)
- - 96. The text-to-speech synthesis system of claim 95 wherein the instances are the instances of a phoneme, a diphone, a syllable, a word, or a sequence unit.
  - 97. The text-to-speech synthesis system of claim 95 wherein the feature vectors incorporate phase information of the instances.
  - 98. The text-to-speech synthesis system of claim 95 wherein the cluster radius is controlled by a user.
  - 99. The text-to-speech synthesis system of claim 95 wherein the single instance is the instance corresponding to the centroid of the feature vector cluster.
  - 100. The text-to-speech synthesis system of claim 95 wherein the feature vectors represent the instances are created by matrix-style modal analysis via singular value decomposition of a matrix W,wherein the matrix W is an M×
    - N matrixwhere M is the number of instances, N is the maximum number of segment samples corresponding to an instance, with the matrix W being zero padded to N samples,wherein the singular value decomposition is represented by
      W=USV^Twhere U is the M×
      
      R left singular matrix with row vectors u_i(1≦
      
      i≦
      
      M), S is the R×
      
      R diagonal matrix of singular values s₁≧
      
      s₂≧
      
      . . . ≧
      
      s_R>
      
      0, V is the N×
      
      R right singular matrix with row vectors v_j(1≦
      
      j≦
      
      N), R≦
      
      min (M, N), and ^Tdenotes matrix transposition,wherein the feature vector ū
      
      _iis calculated as
      ū
      
      _i=u_iSwhere u_iis a row vector associated with an instance i, and S is the singular diagonal matrix, andwherein the distance between two feature vectors is determined by a metric comprising a similarity measure, C, between two feature vectors, ū
      
      _iand ū
      
      _j, wherein C is calculated as

101. A machine readable non-transitory storage medium containing executable instructions which when executed by a machine cause the machine to perform a method comprising:
- receiving an input which comprises text;
  
  retrieving data from a voice table, stored in a machine readable medium, the voice table having redundant instances pruned according to a redundancy criterion based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances of speech segments in the voice table,wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments which were provided in sound data for a speech synthesis system, andwherein the data retrieving is performed on a representation of voice units, the representation being stored in a memory of a data processing system which includes a processor which performs the data retrieving.
- View Dependent Claims (102)
- - 102. A medium as in claim 101 wherein clustered instances are represented by a representative instance and wherein the redundancy criterion is based at least in part on phase information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Bellegarda, Jerome R.
Primary Examiner(s)
Smits; Talivaldis Ivars
Assistant Examiner(s)
ROBERTS, SHAUN A

Application Number

US11/546,222
Publication Number

US 20080091428A1
Time in Patent Office

1,806 Days
Field of Search

704/254, 704/245, 704258-269
US Class Current

704/269
CPC Class Codes

G10L 13/06 Elementary speech units use...

Methods and apparatus related to pruning for concatenative text-to-speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

102 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus related to pruning for concatenative text-to-speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

102 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links