Methods and apparatus related to pruning for concatenative text-to-speech synthesis
First Claim
1. A machine-implemented method comprising:
- pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides, among other things, automatic identification of near-redundant units in a large TTS voice table, identifying which units are distinctive enough to keep and which units are sufficiently redundant to discard. According to an aspect of the invention, pruning is treated as a clustering problem in a suitable feature space. All instances of a given unit (e.g. word or characters expressed as Unicode strings) are mapped onto the feature space, and cluster units in that space using a suitable similarity measure. Since all units in a given cluster are, by construction, closely related from the point of view of the measure used, they are suitably redundant and can be replaced by a single instance. The disclosed method can detect near-redundancy in TTS units in a completely unsupervised manner, based on an original feature extraction and clustering strategy. Each unit can be processed in parallel, and the algorithm is totally scalable, with a pruning factor determinable by a user through the near-redundancy criterion. In an exemplary implementation, a matrix-style modal analysis via Singular Value Decomposition (SVD) is performed on the matrix of the observed instances for the given word unit, resulting in each row of the matrix associated with a feature vector, which can then be clustered using an appropriate closeness measure. Pruning results by mapping each instance to the centroid of its cluster.
121 Citations
104 Claims
-
1. A machine-implemented method comprising:
pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. A machine-readable medium having instructions to cause a machine to perform a machine-implemented method comprising:
pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments. - View Dependent Claims (8, 9, 10, 11, 12)
-
13. An apparatus comprising:
means for automatically pruning redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments. - View Dependent Claims (14, 15, 16, 17, 18)
-
19. A system comprising:
-
a processing unit coupled to a memory through a bus; and a process executed from the memory by the processing unit to cause the processing unit to; prune redundancy of instances in a plurality of speech segments, wherein the redundancy criterion is based on a similarity measure between feature vectors derived from a machine perception transformation of the plurality of speech segments. - View Dependent Claims (20, 21, 22, 23, 24)
-
- 25. A redundancy pruned voice table for use in a text-to-speech synthesis system.
- 30. A text-to-speech synthesis system comprising a redundancy pruned voice table.
-
35. A machine-implemented method comprising:
-
identifying instances in a plurality of speech segments; creating feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance. - View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62)
-
-
49. A machine-readable medium having instructions to cause a machine to perform a machine-implemented method comprising:
-
identifying instances in a plurality of speech segments; creating feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance.
-
-
63. An apparatus comprising:
-
means for identifying instances in a plurality of speech segments; means for creating feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space; means for clustering the feature vectors using a similarity measure in the feature space; and means for replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance. - View Dependent Claims (64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76)
-
-
77. A system comprising:
-
a processing unit coupled to a memory through a bus; and a process executed from the memory by the processing unit to cause the processing unit to; identify instances in a plurality of speech segments; create feature vectors derived from a machine perception transformation of the plurality of speech segments onto a feature space; cluster the feature vectors using a similarity measure in the feature space; and replace the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance. - View Dependent Claims (78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90)
-
-
91. A voice table for use in a text-to-speech synthesis system, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
-
identifying instances in the original voice table; creating feature vectors derived from a machine perception transformation of speech segments in the original voice table onto a feature space; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance. - View Dependent Claims (92, 93, 94, 95, 96)
-
-
97. A text-to-speech synthesis system comprising a voice table, wherein the voice table is pruned from an original voice table according to a machine-implemented method comprising:
-
identifying instances in the original voice table; creating feature vectors derived from a machine perception transformation of speech segments in the original voice table onto a feature space; clustering the feature vectors using a similarity measure in the feature space; and replacing the clustered instances corresponding to the clustered feature vectors within a predetermined radius by a single instance. - View Dependent Claims (98, 99, 100, 101, 102)
-
-
103. A machine readable medium containing executable instructions which when executed by a machine cause the machine to perform a method comprising:
-
receiving an input which comprises text; retrieving data from a voice table, stored in a machine readable medium, the voice table having redundant instances pruned according to a redundancy criterion based on a similarity measure between feature vectors derived from a machine perception transformation of speech segments in the voice table. - View Dependent Claims (104)
-
Specification