Speech synthesis system, speech synthesis program product, and speech synthesis method

US 9,275,631 B2
Filed: 12/31/2012
Issued: 03/01/2016
Est. Priority Date: 09/07/2007
Status: Active Grant

First Claim

Patent Images

1. At least one computer-readable storage device encoded with a speech synthesis program which causes a system for synthesizing speech from text to perform:

determining a first speech segment sequence corresponding to an input text, by selecting speech segments from a speech segment database according to a first cost calculated based at least in part on a statistical model stochastically representing frequency slope variations, wherein each segment in the first speech segment sequence is to be used in generating speech corresponding to the input text;

determining prosody modification values for the first speech segment sequence, after the first speech segment sequence is selected, by using a second cost calculated based at least in part on the statistical model stochastically representing frequency slope variations, wherein the first cost is different from the second cost; and

applying the determined prosody modification values to the first speech segment sequence to produce a second speech segment sequence having a same number of speech segments as the first speech segment sequence and whose prosodic characteristics are different from prosodic characteristics of the first speech segment sequence,wherein the second cost for determining the prosody modification values includes a sum of an absolute frequency likelihood cost, a frequency slope likelihood cost, a frequency linear approximation error cost, and a prosody modification cost.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Waveform concatenation speech synthesis with high sound quality. Prosody with both high accuracy and high sound quality is achieved by performing a two-path search including a speech segment search and a prosody modification value search. An accurate accent is secured by evaluating the consistency of the prosody by using a statistical model of prosody variations (the slope of fundamental frequency) for both of two paths of the speech segment selection and the modification value search. In the prosody modification value search, a prosody modification value sequence that minimizes a modified prosody cost is searched for. This allows a search for a modification value sequence that can increase the likelihood of absolute values or variations of the prosody to the statistical model as high as possible with minimum modification values.

106 Citations

15 Claims

1. At least one computer-readable storage device encoded with a speech synthesis program which causes a system for synthesizing speech from text to perform:
- determining a first speech segment sequence corresponding to an input text, by selecting speech segments from a speech segment database according to a first cost calculated based at least in part on a statistical model stochastically representing frequency slope variations, wherein each segment in the first speech segment sequence is to be used in generating speech corresponding to the input text;
  
  determining prosody modification values for the first speech segment sequence, after the first speech segment sequence is selected, by using a second cost calculated based at least in part on the statistical model stochastically representing frequency slope variations, wherein the first cost is different from the second cost; and
  
  applying the determined prosody modification values to the first speech segment sequence to produce a second speech segment sequence having a same number of speech segments as the first speech segment sequence and whose prosodic characteristics are different from prosodic characteristics of the first speech segment sequence,wherein the second cost for determining the prosody modification values includes a sum of an absolute frequency likelihood cost, a frequency slope likelihood cost, a frequency linear approximation error cost, and a prosody modification cost.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The at least one computer readable storage device of claim 1, wherein the first cost for determining the first speech segment sequence includes a spectrum continuity cost, a duration error cost, a volume error cost, an absolute frequency likelihood cost, a frequency slope likelihood cost, and a frequency linear approximation error cost.
  - 3. The at least one computer readable storage device of claim 1, wherein the statistical model uses a decision tree and a Gaussian mixture model.
  - 4. The at least one computer readable storage device of claim 3, wherein the Gaussian mixture model associates features of speech segments with respective frequency slope values.
  - 5. The at least one computer-readable storage device of claim 1, wherein the program further causes the system to increase the prosody modification cost of at least one continuous speech segment in the first speech segment sequence having a slope likelihood greater than a given value.

6. A speech synthesis method for synthesizing speech from text by computer processing, the method comprising:
- determining a first speech segment sequence corresponding to an input text, by selecting speech segments from a speech segment database according to a first cost calculated based at least in part on a statistical model stochastically representing frequency slope variations, wherein each segment in the first speech segment sequence is to be used in generating speech corresponding to the input text;
  
  determining prosody modification values for the first speech segment sequence, after the first speech segment sequence is selected, by using a second cost calculated based at least in part on the statistical model stochastically representing frequency slope variations, wherein the first cost is different from the second cost; and
  
  applying the determined prosody modification values to the first speech segment sequence to produce a second speech segment sequence having a same number of speech segments as the first speech segment sequence and whose prosodic characteristics are different from prosodic characteristics of the first speech segment sequence,wherein the second cost for determining the prosody modification values includes a sum of an absolute frequency likelihood cost, a frequency slope likelihood cost, a frequency linear approximation error cost, and a prosody modification cost.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6, wherein the first cost for determining the first speech segment sequence includes a spectrum continuity cost, a duration error cost, a volume error cost, an absolute frequency likelihood cost, a frequency slope likelihood cost, and a frequency linear approximation error cost.
  - 8. The method of claim 6, wherein the statistical model uses a decision tree and a Gaussian mixture model.
  - 9. The method of claim 8, wherein the Gaussian mixture model associates features of speech segments with respective frequency slope values.
  - 10. The method of claim 6, wherein the method further comprises increasing the prosody modification cost of at least one continuous speech segment in the first speech segment sequence having a slope likelihood greater than a given value.

11. A speech synthesis system for synthesizing speech from text, the system comprising:
- at least one processor configured to;
  
  determine a first speech segment sequence corresponding to an input text, by selecting speech segments from a speech segment database according to a first cost calculated based at least in part on a statistical model stochastically representing frequency slope variations, wherein each segment in the first speech segment sequence is to be used in generating speech corresponding to the input text;
  
  determine prosody modification values for the first speech segment sequence, after the first speech segment sequence is selected, by using a second cost calculated based at least in part on the statistical model stochastically representing frequency slope variations, wherein the first cost is different from the second cost; and
  
  apply the determined prosody modification values to the first speech segment sequence to produce a second speech segment sequence having a same number of speech segments as the first speech segment sequence and whose prosodic characteristics are different from prosodic characteristics of the first speech segment sequence,wherein the second cost for determining the prosody modification values includes a sum of an absolute frequency likelihood cost, a frequency slope likelihood cost, a frequency linear approximation error cost, and a prosody modification cost.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11, wherein the first cost for determining the first speech segment sequence includes a spectrum continuity cost, a duration error cost, a volume error cost, an absolute frequency likelihood cost, a frequency slope likelihood cost, and a frequency linear approximation error cost.
  - 13. The system of claim 11, wherein the statistical model uses a decision tree and a Gaussian mixture model.
  - 14. The system of claim 13, wherein the Gaussian mixture model associates features of speech segments with respective frequency slope values.
  - 15. The system of claim 11, wherein the at least one processor is further configured to increase the prosody modification cost of at least one continuous speech segment in the first speech segment sequence having a slope likelihood greater than a given value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Tachibana, Ryuki, Nishimura, Masafumi
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Sirjani, Fariba

Application Number

US13/731,268
Publication Number

US 20130268275A1
Time in Patent Office

1,156 Days
Field of Search

704/260
US Class Current

1/1
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/07   Concatenation rules

G10L 13/10   Prosody rules derived from ...

Speech synthesis system, speech synthesis program product, and speech synthesis method

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

106 Citations

15 Claims

Specification

Use Cases

Quick Links

Others

Speech synthesis system, speech synthesis program product, and speech synthesis method

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

106 Citations

15 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others