Deep networks for unit selection speech synthesis

US 9,460,704 B2
Filed: 09/06/2013
Issued: 10/04/2016
Est. Priority Date: 09/06/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

obtaining a set of phones that is associated with text that is to be synthesized into speech;

accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones;

providing a particular set of phones for input to the neural network;

receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones;

determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample;

selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample;

synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and

providing the speech for output.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for providing a representation based on structured data in resources. The methods, systems, and apparatus include actions of receiving target acoustic features output from a neural network that has been trained to predict acoustic features given linguistic features. Additional actions include determining a distance between the target acoustic features and acoustic features of a stored acoustic sample. Further actions include selecting the acoustic sample to be used in speech synthesis based at least on the determined distance and synthesizing speech based on the selected acoustic sample.

8 Citations

View as Search Results

17 Claims

1. A method comprising:
- obtaining a set of phones that is associated with text that is to be synthesized into speech;
  
  accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones;
  
  providing a particular set of phones for input to the neural network;
  
  receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones;
  
  determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample;
  
  selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample;
  
  synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and
  
  providing the speech for output.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones comprise a plurality of values describing acoustic characteristics.
  - 3. The method of claim 2, wherein determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample comprises:
    - calculating an Euclidean distance between a point represented by the values of the set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and a point represented by values describing the set of acoustic features that is associated with the stored acoustic sample.
  - 4. The method of claim 1, wherein selecting the acoustic sample to be used in synthesizing the text into speech based on at least the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample comprises:
    - determining the acoustic sample corresponds to a cost based on the determined distance that is less than or equal to costs based on other determined distances between the particular set of target acoustic features and sets of acoustic features of other stored acoustic samples.
  - 5. The method of claim 1, wherein selecting the acoustic sample to be used in synthesizing the text into speech is further based on at least a join cost of the acoustic sample representing discontinuity of the acoustic sample and another acoustic sample consecutive with the acoustic sample.
  - 6. The method of claim 5, wherein selecting the acoustic sample to be used in synthesizing the text into speech based on at least the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample comprises:
    - determining the acoustic sample corresponds to a cost, based on (i) the determined distance and (ii) the join cost, that is less than or equal to costs based on (i) determined distances between the target acoustic features and acoustic features of other stored acoustic samples and (ii) join costs of the other stored acoustic samples.
  - 7. The method of claim 1, further comprising:
    - determining a distance between the particular set of target acoustic features and a model that includes the stored acoustic samples and other acoustic samples; and
      
      selecting, based on at least the determined distance, the model to select acoustic samples within the model.

8. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  obtaining a set of phones that is associated with text that is to be synthesized into speech;
  
  accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones;
  
  providing a particular set of phones for input to the neural network;
  
  receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones;
  
  determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample;
  
  selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample;
  
  synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and
  
  providing the speech for output.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The system of claim 8, wherein the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones comprise a plurality of values describing acoustic characteristics.
  - 10. The system of claim 9, wherein determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample comprises:
    - calculating an Euclidean distance between a point represented by the values of the set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and a point represented by values describing the set of acoustic features that is associated with the stored acoustic sample.
  - 11. The system of claim 8, wherein selecting the acoustic sample to be used in synthesizing the text into speech based on at least the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample comprises:
    - determining the acoustic sample corresponds to a cost based on the determined distance that is less than or equal to costs based on other determined distances between the particular set of target acoustic features and sets of acoustic features of other stored acoustic samples.
  - 12. The system of claim 8, wherein selecting the acoustic sample to be used in synthesizing the text into speech is further based on at least a join cost of the acoustic sample representing discontinuity of the acoustic sample and another acoustic sample consecutive with the acoustic sample.

13. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- obtaining a set of phones that is associated with text that is to be synthesized into speech;
  
  accessing a neural network that has been trained to estimate a set of target acoustic features that represent a close acoustic match to a given set of phones;
  
  providing a particular set of phones for input to the neural network;
  
  receiving, from the neural network, a particular set of target acoustic features that represents the acoustic match to the particular set of phones;
  
  determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample;
  
  selecting the acoustic sample to be used in synthesizing the text into speech based at least on the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample;
  
  synthesizing, using an automated speech synthesizer, the text into speech using the selected acoustic sample; and
  
  providing the speech for output.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The medium of claim 13, wherein the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones comprise a plurality of values describing acoustic characteristics.
  - 15. The medium of claim 14, wherein determining a distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) a set of acoustic features that is associated with a stored acoustic sample comprises:
    - calculating an Euclidean distance between a point represented by the values of the set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and a point represented by values describing the set of acoustic features that is associated with the stored acoustic sample.
  - 16. The medium of claim 13, wherein selecting the acoustic sample to be used in synthesizing the text into speech based on at least the determined distance between (i) the particular set of target acoustic features that the neural network indicates represents the acoustic match to the particular set of phones and (ii) the set of acoustic features that is associated with the stored acoustic sample comprises:
    - determining the acoustic sample corresponds to a cost based on the determined distance that is less than or equal to costs based on other determined distances between the particular set of target acoustic features and sets of acoustic features of other stored acoustic samples.
  - 17. The medium of claim 13, wherein selecting the acoustic sample to be used in synthesizing the text into speech is further based on at least a join cost of the acoustic sample representing discontinuity of the acoustic sample and another acoustic sample consecutive with the acoustic sample.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Senior, Andrew W., Fructuoso, Javier Gonzalvo
Primary Examiner(s)
Saint Cyr, Leonard

Application Number

US14/019,967
Publication Number

US 20150073804A1
Time in Patent Office

1,124 Days
Field of Search

704/235, 704258-260
US Class Current

1/1
CPC Class Codes

G10L 13/06 Elementary speech units use...

G10L 25/30 using neural networks

Deep networks for unit selection speech synthesis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

8 Citations

17 Claims

Specification

Use Cases

Quick Links

Others

Deep networks for unit selection speech synthesis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

8 Citations

17 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others