Speech synthesizer, speech synthesizing method and program product

US 8,494,856 B2
Filed: 10/12/2011
Issued: 07/23/2013
Est. Priority Date: 04/15/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A speech synthesizer comprising:

a processor;

an analyzer that performs a text analysis of an input document and extracts a linguistic feature used for prosody control;

a first estimator that selects a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information and that estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model;

a selector that selects, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated by the first estimator;

a generator that generates a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit;

a second estimator that re-estimates prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model; and

a synthesizer that generates synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated by the second estimator,wherein the processor executes at least one of the analyzer, the first estimator, the selector, the generator, the second estimator, and the synthesizer.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

According to one embodiment, a speech synthesizer includes an analyzer, a first estimator, a selector, a generator, a second estimator, and a synthesizer. The analyzer analyzes text and extracts a linguistic feature. The first estimator selects a first prosody model adapted to the linguistic feature and estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model. The selector selects speech units that minimize a cost function determined in accordance with the prosody information. The generator generates a second prosody model that is a model of the prosody information of the speech units. The second estimator estimates prosody information that maximizes a third likelihood calculated on the basis of the first likelihood and a second likelihood representing probability of the second prosody model. The synthesizer generates synthetic speech by concatenating the speech units on the basis of the prosody information estimated by the second estimator.

Citations

5 Claims

1. A speech synthesizer comprising:
- a processor;
  
  an analyzer that performs a text analysis of an input document and extracts a linguistic feature used for prosody control;
  
  a first estimator that selects a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information and that estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model;
  
  a selector that selects, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated by the first estimator;
  
  a generator that generates a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit;
  
  a second estimator that re-estimates prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model; and
  
  a synthesizer that generates synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated by the second estimator,wherein the processor executes at least one of the analyzer, the first estimator, the selector, the generator, the second estimator, and the synthesizer.
- View Dependent Claims (2, 3)
- - 2. The speech synthesizer according to claim 1, whereinthe selector newly selects the candidates of the speech unit string that minimize the cost function determined in accordance with the prosody information estimated by the second estimator, andthe synthesizer generates synthetic speech by concatenating the speech units included in the newly selected candidates on the basis of the prosody information estimated by the second estimator.
  - 3. The speech synthesizer according to claim 2, whereinthe generator further generates the second prosody model of the speech units included in the newly selected candidates,the second estimator further estimates prosody information that maximizes the third likelihood calculated by linearly coupling the second likelihood of the second prosody model generated from the speech units included in the newly selected candidates and the first likelihood, andthe synthesizer generates synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated by the second estimator when the number of estimations of prosody information performed by the second estimator exceeds a predetermined threshold.

4. A speech synthesis method comprising:
- performing a text analysis of an input document and extracting a linguistic feature used for prosody control;
  
  selecting a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information, and first estimating in which prosody information that maximizes a first likelihood representing probability of the selected first prosody model is estimated;
  
  selecting, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated in the first estimating;
  
  generating a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit;
  
  second estimating in which prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model is estimated; and
  
  generating synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated in the second estimating.

5. Non-transitory computer readable medium including programmed instructions, wherein the instructions, when executed by a computer, causes the computer to perform:
- performing an text analysis of an input document and extracting a linguistic feature used for prosody control;
  
  selecting a first prosody model adapted to the extracted linguistic feature from predetermined first prosody models that are models of speech prosody information, and first estimating in which prosody information that maximizes a first likelihood representing probability of the selected first prosody model is estimated;
  
  selecting, from a speech unit storage storing speech units, a plurality of candidates of a speech unit string that minimizes a cost function determined in accordance with the prosody information estimated in the first estimating;
  
  generating a second prosody model that is a statistical model of prosody information of the speech unit included in the selected candidates, for each speech unit;
  
  second estimating in which prosody information that maximizes a third likelihood by differentiating the third likelihood with respect to a parameter of the second prosody model, the third likelihood being calculated by linearly coupling the first likelihood and a second likelihood representing probability of the second prosody model is estimated;
  
  and generating synthetic speech by concatenating the speech units included in the selected candidates on the basis of the prosody information estimated in the second estimating.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Latorre, Javier, Akamine, Masami
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Sharma, Neeraj

Application Number

US13/271,321
Publication Number

US 20120089402A1
Time in Patent Office

650 Days
Field of Search

704/260, 704/3, 704/244, 704/250, 704/264, 704/268, 345/440.1
US Class Current

704/260
CPC Class Codes

G10L 13/10 Prosody rules derived from ...

Speech synthesizer, speech synthesizing method and program product

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Speech synthesizer, speech synthesizing method and program product

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links