DEEP NETWORKS FOR UNIT SELECTION SPEECH SYNTHESIS

US 20150073804A1
Filed: 09/06/2013
Published: 03/12/2015
Est. Priority Date: 09/06/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features;

determining a distance between the target acoustic features and acoustic features of a stored acoustic sample;

selecting the acoustic sample to be used in speech synthesis based at least on the determined distance; and

synthesizing speech based on the selected acoustic sample.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for providing a representation based on structured data in resources. The methods, systems, and apparatus include actions of receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features. Additional actions include determining a distance between the target acoustic features and acoustic features of a stored acoustic sample. Further actions include selecting the acoustic sample to be used in speech synthesis based at least on the determined distance and synthesizing speech based on the selected acoustic sample.

206 Citations

20 Claims

1. A method comprising:
- receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features;
  
  determining a distance between the target acoustic features and acoustic features of a stored acoustic sample;
  
  selecting the acoustic sample to be used in speech synthesis based at least on the determined distance; and
  
  synthesizing speech based on the selected acoustic sample.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - providing the synthesized speech for output.
  - 3. The method of claim 1, wherein the target acoustic features comprise aplurality of values describing acoustic characteristics.
  - 4. The method of claim 3, wherein determining a distance between the target acoustic features and acoustic features of a stored acoustic sample comprises:
    - calculating an Euclidean distance between a point represented by the values of the target acoustic features and a point represented by values describing the acoustic features of the stored acoustic sample.
  - 5. The method of claim 1, wherein selecting the acoustic sample to be used in speech synthesis based on at least the determined distance comprises:
    - determining the acoustic sample corresponds to a cost based on the determined distance that is less than or equal to costs based on other determined distances between the target acoustic features and acoustic features of other stored acoustic samples.
  - 6. The method of claim 1, wherein selecting the acoustic sample to be used in speech synthesis is further based on at least a join cost of the acoustic sample representing discontinuity of the acoustic sample and another acoustic sample consecutive with the acoustic sample.
  - 7. The method of claim 6, wherein selecting the acoustic sample to be used in speech synthesis based on at least the determined distance comprises:
    - determining the acoustic sample corresponds to a cost, based on (i) the determined distance and (ii) the join cost, that is less than or equal to costs based on (i) determined distances between the target acoustic features and acoustic features of other stored acoustic samples and (ii) join costs of the other stored acoustic samples.
  - 8. The method of claim 1, further comprising:
    - determining a distance between the target acoustic features and a model that includes the stored acoustic samples and other acoustic samples; and
      
      selecting, based on at least the determined distance, the model to select acoustic samples within the model.

9. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features;
  
  determining a distance between the target acoustic features and acoustic features of a stored acoustic sample;
  
  selecting the acoustic sample to be used in speech synthesis based at least on the determined distance; and
  
  synthesizing speech based on the selected acoustic sample.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The system of claim 9, further comprising:
    - providing the synthesized speech for output.
  - 11. The system of claim 9, wherein the target acoustic features comprise a plurality of values describing acoustic characteristics.
  - 12. The system of claim 11, wherein determining a distance between the target acoustic features and acoustic features of a stored acoustic sample comprises:
    - calculating an Euclidean distance between a point represented by the values of the target acoustic features and a point represented by values describing the acoustic features of the stored acoustic sample.
  - 13. The system of claim 9, wherein selecting the acoustic sample to be used in speech synthesis based on at least the determined distance comprises:
    - determining the acoustic sample corresponds to a cost based on the determined distance that is less than or equal to costs based on other determined distances between the target acoustic features and acoustic features of other stored acoustic samples.
  - 14. The system of claim 9, wherein selecting the acoustic sample to be used in speech synthesis is further based on at least a join cost of the acoustic sample representing discontinuity of the acoustic sample and another acoustic sample consecutive with the acoustic sample.

15. A computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features;
  
  determining a distance between the target acoustic features and acoustic features of a stored acoustic sample;
  
  selecting the acoustic sample to be used in speech synthesis based at least on the determined distance; and
  
  synthesizing speech based on the selected acoustic sample.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The medium of claim 15, further comprising:
    - providing the synthesized speech for output.
  - 17. The medium of claim 15, wherein the target acoustic features comprise a plurality of values describing acoustic characteristics.
  - 18. The medium of claim 17, wherein determining a distance between the target acoustic features and acoustic features of a stored acoustic sample comprises:
    - calculating an Euclidean distance between a point represented by the values of the target acoustic features and a point represented by values describing the acoustic features of the stored acoustic sample.
  - 19. The medium of claim 15, wherein selecting the acoustic sample to be used in speech synthesis based on at least the determined distance comprises:
    - determining the acoustic sample corresponds to a cost based on the determined distance that is less than or equal to costs based on other determined distances between the target acoustic features and acoustic features of other stored acoustic samples.
  - 20. The medium of claim 15, wherein selecting the acoustic sample to be used in speech synthesis is further based on at least a join cost of the acoustic sample representing discontinuity of the acoustic sample and another acoustic sample consecutive with the acoustic sample.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Senior, Andrew W., Fructuoso, Javier Gonzalvo

Granted Patent

US 9,460,704 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/259
CPC Class Codes

G10L 13/06 Elementary speech units use...

G10L 25/30 using neural networks

DEEP NETWORKS FOR UNIT SELECTION SPEECH SYNTHESIS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

206 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

DEEP NETWORKS FOR UNIT SELECTION SPEECH SYNTHESIS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

206 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links