Methods and apparatus related to pruning for concatenative text-to-speech synthesis
First Claim
1. A machine-implemented method comprising:
- pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides, among other things, automatic identification of near-redundant units in a large TTS voice table, identifying which units are distinctive enough to keep and which units are sufficiently redundant to discard. According to an aspect of the invention, pruning is treated as a clustering problem in a suitable feature space. All instances of a given unit (e.g. word or characters expressed as Unicode strings) are mapped onto the feature space, and cluster units in that space using a suitable similarity measure. Since all units in a given cluster are, by construction, closely related from the point of view of the measure used, they are suitably redundant and can be replaced by a single instance. The disclosed method can detect near-redundancy in TTS units in a completely unsupervised manner, based on an original feature extraction and clustering strategy. Each unit can be processed in parallel, and the algorithm is totally scalable, with a pruning factor determinable by a user through the near-redundancy criterion. In an exemplary implementation, a matrix-style modal analysis via Singular Value Decomposition (SVD) is performed on the matrix of the observed instances for the given word unit, resulting in each row of the matrix associated with a feature vector, which can then be clustered using an appropriate closeness measure. Pruning results by mapping each instance to the centroid of its cluster.
-
Citations
102 Claims
-
1. A machine-implemented method comprising:
pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. A machine-readable non-transitory storage medium having instructions to cause a machine to perform a machine-implemented method comprising:
-
pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system, and wherein the redundancy pruning is performed on a representation of voice units, the representation being stored in a memory of a data processing system which includes a processor which performs the pruning. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. An apparatus comprising:
means for automatically pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system. - View Dependent Claims (14, 15, 16, 17, 18)
-
19. A system comprising:
-
a processing unit coupled to a memory through a bus; and a process executed from the memory by the processing unit to cause the processing unit to; prune redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system. - View Dependent Claims (20, 21, 22, 23, 24)
-
-
25. A redundancy pruned voice table comprising a redundancy pruned voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
pruning redundancy of instances in the original voice table, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system. - View Dependent Claims (26, 27, 28)
-
29. A text-to-speech synthesis system comprising a redundancy pruned voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
pruning redundancy of instances in the original voice table, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments, wherein the instances subjected to redundancy pruning are clustered together with feature vectors discernably separated from each other in the machine perception transformation and wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system. - View Dependent Claims (30, 31, 32)
-
33. A machine-implemented method comprising:
-
identifying instances in a plurality of speech segments; creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
-
-
47. A machine-readable non-transitory storage medium having instructions to cause a machine to perform a machine-implemented method comprising:
-
identifying instances in a plurality of speech segments; creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance, wherein the identifying instances, the creating feature vectors, the clustering feature vectors, and the replacing clustered instances are performed on a representation of speech segments, the representation being stored in a memory of a data processing system which includes a processor which performs the pruning. - View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60)
-
-
61. An apparatus comprising:
-
means for identifying instances in a plurality of speech segments; means for creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system; means for clustering the feature vectors using a similarity measure in the feature space; and means for replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance. - View Dependent Claims (62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74)
-
-
75. A system comprising:
-
a processing unit coupled to a memory through a bus; and a process executed from the memory by the processing unit to cause the processing unit to; identify instances in a plurality of speech segments; create feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances in the plurality of speech segments onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system; cluster the feature vectors using a similarity measure in the feature space; and replace the clustered instances corresponding to the clustered feature vectors within a radius by a single instance. - View Dependent Claims (76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88)
-
-
89. A voice table for use in a text-to-speech synthesis system, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
-
identifying instances in the original voice table; creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances of speech segments in the original voice table onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments, which were provided in sound data for a speech synthesis system; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance. - View Dependent Claims (90, 91, 92, 93, 94)
-
-
95. A text-to-speech synthesis system comprising a voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
-
identifying instances in the original voice table; creating feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances of speech segments in the original voice table onto a feature space, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a radius by a single instance. - View Dependent Claims (96, 97, 98, 99, 100)
-
-
101. A machine readable non-transitory storage medium containing executable instructions which when executed by a machine cause the machine to perform a method comprising:
-
receiving an input which comprises text; retrieving data from a voice table, stored in a machine readable medium, the voice table having redundant instances pruned according to a redundancy criterion based on a similarity measure between feature vectors derived from a machine perception transformation of time-domain samples corresponding to the instances of speech segments in the voice table, wherein the machine perception transformation is correlated with human perception by using the time-domain samples retaining both amplitude and phase information of the speech segments which were provided in sound data for a speech synthesis system, and wherein the data retrieving is performed on a representation of voice units, the representation being stored in a memory of a data processing system which includes a processor which performs the data retrieving. - View Dependent Claims (102)
-
Specification