UNIT-SELECTION TEXT-TO-SPEECH SYNTHESIS BASED ON PREDICTED CONCATENATION PARAMETERS

US 20170345411A1
Filed: 09/15/2016
Published: 11/30/2017
Est. Priority Date: 05/26/2016
Status: Active Grant

First Claim

Patent Images

1. A system for unit-selection text-to-speech synthesis, the system comprising:

one or more processors; and

memory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to;

receive text to be converted to speech;

generate a sequence of target units representing a spoken pronunciation of the text;

determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit;

select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;

for each candidate speech segment of the plurality of candidate speech segments;

determine a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and

determine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units;

select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and

generate speech corresponding to the received text using the subset of candidate speech segments.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for performing unit-selection text-to-speech synthesis are provided. In an example process, text to be converted to speech is received. The text is represented as a sequence of target units. A plurality of candidate speech segments corresponding to the sequence of target units are selected. Predicted statistical parameters of acoustic features associated with the sequence of target units are determined. The predicted statistical parameters of acoustic features are used to determine target costs and concatenation costs associated with the plurality of candidate speech segments. Based on a combined cost determined from the target costs and concatenation costs, a subset of candidate speech segments is selected from the plurality of candidate speech segments. Speech corresponding to the received text is generated using the subset of candidate speech segments.

Citations

25 Claims

1. A system for unit-selection text-to-speech synthesis, the system comprising:
- one or more processors; and
  
  memory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to;
  
  receive text to be converted to speech;
  
  generate a sequence of target units representing a spoken pronunciation of the text;
  
  determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit;
  
  select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;
  
  for each candidate speech segment of the plurality of candidate speech segments;
  
  determine a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and
  
  determine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units;
  
  select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and
  
  generate speech corresponding to the received text using the subset of candidate speech segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The system of claim 1, wherein the second acoustic feature represents a change of the first acoustic feature.
  - 3. The system of claim 2, wherein the change of the first acoustic feature is with respect to an end of the respective target unit.
  - 4. The system of claim 1, wherein the first acoustic feature comprises fundamental frequency and the second acoustic feature comprises a change in the fundamental frequency at an end of the respective target unit.
  - 5. The system of claim 1, wherein the first acoustic feature comprises a mel-frequency cepstral coefficient and the second acoustic feature comprises a change in the mel-frequency cepstral coefficient at an end of the respective target unit.
  - 6. The system of claim 1, wherein the plurality of acoustic features include a fundamental frequency at a first portion of the respective target unit and a fundamental frequency at a second portion of the respective target unit.
  - 7. The system of claim 1, wherein the plurality of acoustic features includes a first plurality of mel-frequency cepstral coefficients at a first portion of the respective target unit and a second plurality of mel-frequency cepstral coefficients at a second portion of the respective target unit.
  - 8. The system of claim 1, wherein the plurality of acoustic features includes a duration of the respective target unit.
  - 9. The system of claim 1, wherein the predicted statistical parameters of the second acoustic feature is not derived from the predicted statistical parameters of the first acoustic feature.
  - 10. The system of claim 1, wherein the predicted statistical parameters for each of the plurality of acoustic features include a mean parameter for each of the plurality of acoustic features and a variance parameter for each of the plurality of acoustic features.
  - 11. The system of claim 1, wherein the target cost for a respective candidate speech segment is based on a weighted difference between an actual value of the first acoustic feature for the respective candidate speech segment and a first predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit, and wherein the weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit.
  - 12. The system of claim 1, wherein a concatenation cost of the plurality of concatenation costs for a respective candidate speech segment includes a second weighted difference between an actual value of the second acoustic feature for the respective candidate speech segment with respect to a subsequent candidate speech segment of the plurality of subsequent candidate speech segments and a first predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit, and wherein the second weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit.
  - 13. The system of claim 12, wherein the actual value of the second acoustic feature for the respective candidate speech segment with respect to the subsequent candidate speech segment of the plurality of subsequent candidate speech segments comprises a difference between an actual value of the first acoustic feature at an end of the respective candidate speech segment and an actual value of the first acoustic feature at a beginning of the subsequent candidate speech segment.
  - 14. The system of claim 1, wherein the predicted statistical parameters for each of the plurality of acoustic features associated with each target unit are determined using a statistical model.
  - 15. The system of claim 14, wherein the statistical model is composed by a mixture of probability distributions.
  - 16. The system of claim 14, wherein the statistical model is configured to:
    - receive, as inputs, the plurality of linguistic features associated with a respective target unit; and
      
      output the predicted statistical parameters for each of the plurality of acoustic features associated with the respective target unit.
  - 17. The system of claim 16, wherein the statistical model is further configured to output one or more density weights for each of the plurality of acoustic features associated with the respective target unit.
  - 18. The system of claim 14, wherein the statistical model is a mixture density network comprising:
    - an input layer configured to receive as inputs the plurality of linguistic features associated with a respective target unit;
      
      an output layer configured to output the predicted statistical parameters for each of the plurality of acoustic features associated with the respective target unit; and
      
      at least one hidden layer between the input layer and the output layer.
  - 19. The system of claim 14, wherein the statistical model is configured to determine, for each target unit, the predicted statistical parameters of the second acoustic feature independent of the predicted statistical parameters of the first acoustic feature.

20. A method for unit-selection text-to-speech synthesis, comprising:
- at an electronic device having a processor and memory;
  
  receiving text to be converted to speech;
  
  generating a sequence of target units representing a spoken pronunciation of the text;
  
  determining, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit;
  
  selecting, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;
  
  for each candidate speech segment of the plurality of candidate speech segments;
  
  determining a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and
  
  determining a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units;
  
  selecting from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and
  
  generating speech corresponding to the received text using the subset of candidate speech segments.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The method of claim 20, wherein the second acoustic feature represents a change of the first acoustic feature.
  - 22. The method of claim 20, wherein the target cost for a respective candidate speech segment is based on a weighted difference between an actual value of the first acoustic feature for the respective candidate speech segment and a first predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit, and wherein the weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit.
  - 23. The method of claim 20, wherein a concatenation cost of the plurality of concatenation costs for a respective candidate speech segment includes a second weighted difference between an actual value of the second acoustic feature for the respective candidate speech segment with respect to a subsequent candidate speech segment of the plurality of subsequent candidate speech segments and a first predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit, and wherein the second weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit.
  - 24. The method of claim 23, wherein the actual value of the second acoustic feature for the respective candidate speech segment with respect to the subsequent candidate speech segment of the plurality of subsequent candidate speech segments comprises a difference between an actual value of the first acoustic feature at an end of the respective candidate speech segment and an actual value of the first acoustic feature at a beginning of the subsequent candidate speech segment.

25. A non-transitory computer-readable storage medium comprising computer-readable instructions which, when executed by one or more processors, cause the one or more processors to:
- receive text to be converted to speech;
  
  generate a sequence of target units representing a spoken pronunciation of the text;
  
  determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit;
  
  select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;
  
  for each candidate speech segment of the plurality of candidate speech segments;
  
  determine a target cost based on the predicted statistical parameters of a first acoustic feature of the plurality of acoustic features associated with a respective target unit of the sequence of target units; and
  
  determine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of a second acoustic feature of the plurality of acoustic features associated with the respective target unit of the sequence of target units;
  
  select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; and
  
  generate speech corresponding to the received text using the subset of candidate speech segments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
RAITIO, Tuomo J., PRAHALLAD, Kishore Sunkeswari, CONKIE, Alistair D., GOLIPOUR, Ladan, WINARSKY, David A.

Granted Patent

US 9,934,775 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/0335   Pitch control

G10L 13/06   Elementary speech units use...

G10L 13/07   Concatenation rules

G10L 13/10   Prosody rules derived from ...

UNIT-SELECTION TEXT-TO-SPEECH SYNTHESIS BASED ON PREDICTED CONCATENATION PARAMETERS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

UNIT-SELECTION TEXT-TO-SPEECH SYNTHESIS BASED ON PREDICTED CONCATENATION PARAMETERS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links