Devices and methods for speech unit reduction in text-to-speech synthesis systems

US 8,751,236 B1
Filed: 10/23/2013
Issued: 06/10/2014
Est. Priority Date: 10/23/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, at a device, a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes;

determining, by the device, concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term;

clustering, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and

based on a determination that the first speech sound has the given concatenation features represented in the given cluster, providing a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A device may receive a plurality of speech sounds that are indicative of pronunciations of a first linguistic term. The device may determine concatenation features of the plurality of speech sounds. The concatenation features may be indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated. The first speech sound may be included in the plurality of speech sounds and the second speech sound may be indicative of a pronunciation of a second linguistic term. The device may cluster the plurality of speech sounds into one or more clusters based on the concatenation features. The device may provide a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated.

15 Citations

View as Search Results

20 Claims

1. A method comprising:
- receiving, at a device, a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes;
  
  determining, by the device, concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term;
  
  clustering, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and
  
  based on a determination that the first speech sound has the given concatenation features represented in the given cluster, providing a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - determining, based on the given concatenation features of the one or more speech sounds in the given cluster, a space representation of the given cluster that includes one or more dimensions, wherein a given dimension corresponds to one of the given concatenation features;
      
      determining, by the device, a centroid of the given cluster, wherein the centroid is indicative of mean values of the given concatenation features in the one or more dimensions; and
      
      identifying, from within the given cluster, the representative speech sound based on the representative speech sound having concatenation features with values that are at a minimum distance from the centroid compared to concatenation features of other speech sounds in the given cluster.
  - 3. The method of claim 1, wherein the plurality of speech sounds is a first plurality of speech sounds, the method further comprising:
    - determining, by the device, a second plurality of speech sounds that includes representative speech sounds of the one or more clusters.
  - 4. The method of claim 1, further comprising:
    - receiving, by the device, configuration input indicative of a reduction for the plurality of speech sounds; and
      
      determining, based on the reduction, a quantity of the one or more clusters.
  - 5. The method of claim 1, wherein the concatenation features of the first speech sound include one or more of a last Fundamental Frequency value (F0), at least one frame of a spectral representation of a beginning portion of the first speech sound, or at least one frame of a spectral representation of an ending portion of the first speech sound.
  - 6. The method of claim 5, wherein the first speech sound is indicative of a pronunciation of a diphone, wherein the concatenation features of the first speech sound include one or more of a duration of the pronunciation of a first half of the diphone, a duration of the pronunciation of a second half of the diphone, F0 of the pronunciation of the first half of the diphone, F0 of the pronunciation of a center portion of the diphone, or F0 of the pronunciation of the second half of the diphone.
  - 7. The method of claim 6, further comprising:
    - receiving, by the device, configuration input indicative of a selection of the concatenation features to be included in the clustering.
  - 8. The method of claim 1, wherein the clustering metric includes a centroid-based metric indicative of the given concatenation features in the given cluster having a given distance from a centroid of the given cluster that is less than a threshold distance, a distribution-based metric indicative of the given concatenation features being associated to a given statistical distribution, a density-based metric indicative of the given cluster including the one or more speech sounds such that the given cluster has a given density greater than a threshold density, or a connectivity-based metric indicative of the given concatenation features having a connectivity distance that is less than the threshold distance.

9. A non-transitory computer readable medium having stored therein instructions, that when executed by a device, cause the device to perform functions, the functions comprising:
- receiving, at the device, a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes;
  
  determining, by the device, concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term;
  
  clustering, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and
  
  based on a determination that the first speech sound has the given concatenation features represented in the given cluster, providing a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The non-transitory computer readable medium of claim 9, the functions further comprising:
    - determining, based on the given concatenation features of the one or more speech sounds in the given cluster, a space representation of the given cluster that includes one or more dimensions, wherein a given dimension corresponds to one of the given concatenation features;
      
      determining, by the device, a centroid of the given cluster, wherein the centroid is indicative of mean values of the given concatenation features in the one or more dimensions; and
      
      identifying, from within the given cluster, the representative speech sound based on the representative speech sound having concatenation features with values that are at a minimum distance from the centroid compared to concatenation features of other speech sounds in the given cluster.
  - 11. The non-transitory computer readable medium of claim 9, wherein the plurality of speech sounds is a first plurality of speech sounds, the functions further comprising:
    - determining, by the device, a second plurality of speech sounds that includes representative speech sounds of the one or more clusters.
  - 12. The non-transitory computer readable medium of claim 9, the functions further comprising:
    - receiving, by the device, configuration input indicative of a reduction for the plurality of speech sounds; and
      
      determining, based on the reduction, a quantity of the one or more clusters.
  - 13. The non-transitory computer readable medium of claim 9, wherein the concatenation features of the first speech sound include one or more of a last Fundamental Frequency value (F0), at least one frame of a spectral representation of a beginning portion of the first speech sound, or at least one frame of a spectral representation of an ending portion of the first speech sound.
  - 14. The non-transitory computer readable medium of claim 13, wherein the first speech sound is indicative of a pronunciation of a diphone, wherein the concatenation features of the first speech sound include one or more of a duration of the pronunciation of a first half of the diphone, a duration of the pronunciation of a second half of the diphone, F0 of the pronunciation of the first half of the diphone, F0 of the pronunciation of a center portion of the diphone, or F0 of the pronunciation of the second half of the diphone.

15. A device comprising:
- one or more processors; and
  
  data storage configured to store instructions executable by the one or more processors to cause the device to;
  
  receive a plurality of speech sounds that are each indicative of a different full pronunciation of a first linguistic term, wherein the first linguistic term includes a representation of one or more phonemes;
  
  determine concatenation features of the plurality of speech sounds of the first linguistic term, wherein the concatenation features are indicative of an acoustic transition between a first speech sound and a second speech sound when the first speech sound and the second speech sound are concatenated, wherein the first speech sound is included in the plurality of speech sounds of the first linguistic term and the second speech sound is indicative of a pronunciation of a second linguistic term;
  
  cluster, based on the concatenation features, the plurality of speech sounds into one or more clusters, wherein a given cluster includes one or more speech sounds of the plurality of speech sounds that have given concatenation features that are related by a clustering metric; and
  
  based on a determination that the first speech sound has the given concatenation features represented in the given cluster, provide a representative speech sound of the given cluster as the first speech sound when the first speech sound and the second speech sound are concatenated.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The device of claim 15, wherein the instructions executable by the one or more processors further cause the device to:
    - determine, based on the given concatenation features of the one or more speech sounds in the given cluster, a space representation of the given cluster that includes one or more dimensions, wherein a given dimension corresponds to one of the given concatenation features;
      
      determine a centroid of the given cluster, wherein the centroid is indicative of mean values of the given concatenation features in the one or more dimensions; and
      
      identify, from within the given cluster, the representative speech sound based on the representative speech sound having concatenation features with values that are at a minimum distance from the centroid compared to concatenation features of other speech sounds in the given cluster.
  - 17. The device of claim 15, wherein the plurality of speech sounds is a first plurality of speech sounds, wherein the instructions executable by the one or more processors further cause the device to:
    - determine a second plurality of speech sounds that includes representative speech sounds of the one or more clusters.
  - 18. The device of claim 15, wherein the instructions executable by the one or more processors further cause the device to:
    - receive configuration input indicative of a reduction for the plurality of speech sounds; and
      
      determine, based on the reduction, a quantity of the one or more clusters.
  - 19. The device of claim 15, wherein the concatenation features of the first speech sound include one or more of a last Fundamental Frequency value (F0), at least one frame of a spectral representation of a beginning portion of the first speech sound, or at least one frame of a spectral representation of an ending portion of the first speech sound.
  - 20. The device of claim 19, wherein the first speech sound is indicative of a pronunciation of a diphone, wherein the concatenation features of the first speech sound include one or more of a duration of the pronunciation of a first half of the diphone, a duration of the pronunciation of a second half of the diphone, F0 of the pronunciation of the first half of the diphone, F0 of the pronunciation of a center portion of the diphone, or F0 of the pronunciation of the second half of the diphone.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Agiomyrgiannakis, Ioannis, Fructuoso, Javier Gonzalvo, Gutkin, Alexander
Primary Examiner(s)
Neway, Samuel G

Application Number

US14/061,118
Time in Patent Office

230 Days
Field of Search

704258-269
US Class Current

704/258
CPC Class Codes

G10L 13/06 Elementary speech units use...

Devices and methods for speech unit reduction in text-to-speech synthesis systems

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

15 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Devices and methods for speech unit reduction in text-to-speech synthesis systems

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links