Devices and methods for speech unit reduction in text-to-speech synthesis systems
First Claim
1. A method comprising:
- receiving, at a device, a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes;
determining, by the device, concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term;
clustering, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and
based on a determination that the first speech sound has the given concatenation features represented in the given cluster, providing a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated.
2 Assignments
0 Petitions
Accused Products
Abstract
A device may receive a plurality of speech sounds that are indicative of pronunciations of a first linguistic term. The device may determine concatenation features of the plurality of speech sounds. The concatenation features may be indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated. The first speech sound may be included in the plurality of speech sounds and the second speech sound may be indicative of a pronunciation of a second linguistic term. The device may cluster the plurality of speech sounds into one or more clusters based on the concatenation features. The device may provide a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated.
15 Citations
20 Claims
-
1. A method comprising:
-
receiving, at a device, a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes; determining, by the device, concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term; clustering, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and based on a determination that the first speech sound has the given concatenation features represented in the given cluster, providing a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer readable medium having stored therein instructions, that when executed by a device, cause the device to perform functions, the functions comprising:
-
receiving, at the device, a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes; determining, by the device, concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term; clustering, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and based on a determination that the first speech sound has the given concatenation features represented in the given cluster, providing a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A device comprising:
-
one or more processors; and data storage configured to store instructions executable by the one or more processors to cause the device to; receive a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes; determine concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term; cluster, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and based on a determination that the first speech sound has the given concatenation features represented in the given cluster, provide a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification