SPEECH SYNTHESIS SYSTEM, SPEECH SYNTHESIS PROGRAM PRODUCT, AND SPEECH SYNTHESIS METHOD

US 20090070115A1
Filed: 08/15/2008
Published: 03/12/2009
Est. Priority Date: 09/07/2007
Status: Active Grant

First Claim

Patent Images

1. A speech synthesis system for synthesizing speech from text, comprising:

a speech segment database for storing data of speech segments having prosody information;

means for entering a text to be speech-synthesized;

means for determining a speech segment sequence corresponding to the input text from the speech segment database so as to minimize a cost including at least a frequency slope likelihood cost on the basis of a statistical model of prosody variations;

means for determining prosody modification values so as to minimize a cost including at least the frequency slope likelihood cost and a prosody modification cost on the basis of the statistical model of prosody variations regarding the determined speech segment sequence; and

means for applying the determined prosody modification values to the determined speech segment sequence.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

It is an objective of the present invention to provide waveform concatenation speech synthesis with high sound quality utilizing its advantages in the case where there is a large quantity of speech segments while providing waveform concatenation speech synthesis with accurate accents in other cases. Prosody with both high accuracy and high sound quality is achieved by performing a two-path search including a speech segment search and a prosody modification value search. In the preferred embodiment of the present invention, an accurate accent is secured by evaluating the consistency of the prosody by using a statistical model of prosody variations (the slope of fundamental frequency) for both of two paths of the speech segment selection and the modification value search. In the prosody modification value search, a prosody modification value sequence that minimizes a modified prosody cost is searched for. This allows a search for a modification value sequence that can increase the likelihood of absolute values or variations of the prosody to the statistical model as high as possible with minimum modification values.

Citations

15 Claims

1. A speech synthesis system for synthesizing speech from text, comprising:
- a speech segment database for storing data of speech segments having prosody information;
  
  means for entering a text to be speech-synthesized;
  
  means for determining a speech segment sequence corresponding to the input text from the speech segment database so as to minimize a cost including at least a frequency slope likelihood cost on the basis of a statistical model of prosody variations;
  
  means for determining prosody modification values so as to minimize a cost including at least the frequency slope likelihood cost and a prosody modification cost on the basis of the statistical model of prosody variations regarding the determined speech segment sequence; and
  
  means for applying the determined prosody modification values to the determined speech segment sequence.
- View Dependent Claims (2, 3, 4, 5)
- - 2. A speech synthesis system according to claim 1, further comprising means for increasing the prosody modification cost of continuous speech segments having a slope likelihood greater than a given value before determining the prosody modification values in response to detection of the continuous speech segments in the speech segment sequence.
  - 3. A speech synthesis system according to claim 1, wherein the cost for determining the speech segment sequence includes a spectrum continuity cost, a duration error cost, a volume error cost, an absolute frequency likelihood cost, a frequency slope likelihood cost, and a frequency linear approximation error cost.
  - 4. A speech synthesis system according to claim 1, wherein the cost for determining the prosody modification values includes the absolute frequency likelihood cost, the frequency slope likelihood cost, the frequency linear approximation error cost, and the prosody modification cost.
  - 5. A speech synthesis system according to claim 1, wherein the statistical model uses a decision tree and Gaussian mixture models.

6. A speech synthesis program product which causes a system for synthesizing speech from text, the system storing a speech segment database which holds data of speech segments having prosody information, to perform the steps of:
- entering the text to be speech-synthesized;
  
  determining a speech segment sequence corresponding to the input text from the speech segment database so as to minimize a cost including at least a frequency slope likelihood cost on the basis of a statistical model of prosody variations;
  
  determining prosody modification values so as to minimize a cost including at least the frequency slope likelihood cost and a prosody modification cost on the basis of the statistical model of prosody variations regarding the determined speech segment sequence; and
  
  applying the determined prosody modification values to the determined speech segment sequence.
- View Dependent Claims (7, 8, 9, 10)
- - 7. A program product according to claim 6, further comprising the step of increasing the prosody modification cost of continuous speech segments having a slope likelihood greater than a given value in the speech segment sequence before determining the prosody modification values in response to detection of the continuous speech segments.
  - 8. A program product according to claim 6, wherein the cost for determining the speech segment sequence includes a spectrum continuity cost, a duration error cost, a volume error cost, an absolute frequency likelihood cost, a frequency slope likelihood cost, and a frequency linear approximation error cost.
  - 9. A program product according to claim 6, wherein the cost for determining the prosody modification values includes the absolute frequency likelihood cost, the frequency slope likelihood cost, the frequency linear approximation error cost, and the prosody modification cost.
  - 10. A program product according to claim 6, wherein the statistical model uses a decision tree and a Gaussian mixture model.

11. A speech synthesis method for synthesizing speech from text by computer processing, comprising the steps of:
- entering the text to be speech-synthesized;
  
  determining a speech segment sequence corresponding to the input text from a speech segment database including speech segment data having prosody information so as to minimize a cost including at least a frequency slope likelihood cost on the basis of a statistical model of prosody variations;
  
  determining prosody modification values so as to minimize a cost including at least the frequency slope likelihood cost and a prosody modification cost on the basis of the statistical model of prosody variations regarding the determined speech segment sequence; and
  
  applying the determined prosody modification values to the determined speech segment sequence.
- View Dependent Claims (12, 13, 14, 15)
- - 12. A speech synthesis method according to claim 11, further comprising the step of increasing the prosody modification cost of continuous speech segments having a slope likelihood greater than a given value in the speech segment sequence before determining the prosody modification values in response to detection of the continuous speech segments.
  - 13. A speech synthesis method according to claim 11, wherein the cost for determining the speech segment sequence includes a spectrum continuity cost, a duration error cost, a volume error cost, an absolute frequency likelihood cost, a frequency slope likelihood cost, and a frequency linear approximation error cost.
  - 14. A speech synthesis method according to claim 11, wherein the cost for determining the prosody modification values includes the absolute frequency likelihood cost, the frequency slope likelihood cost, the frequency linear approximation error cost, and the prosody modification cost.
  - 15. A speech synthesis method according to claim 11, wherein the statistical model uses a decision tree and a Gaussian mixture model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
International Business Machines Corporation
Inventors
Nishimura, Masafumi, Tachibana, Ryuki

Granted Patent

US 8,370,149 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/00   Speech synthesis; Text to s...

G10L 13/07   Concatenation rules

G10L 13/10   Prosody rules derived from ...

SPEECH SYNTHESIS SYSTEM, SPEECH SYNTHESIS PROGRAM PRODUCT, AND SPEECH SYNTHESIS METHOD

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH SYNTHESIS SYSTEM, SPEECH SYNTHESIS PROGRAM PRODUCT, AND SPEECH SYNTHESIS METHOD

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links