Text to speech processing system and method, and an acoustic model training system and method

US 10,140,972 B2
Filed: 08/22/2014
Issued: 11/27/2018
Est. Priority Date: 08/23/2013
Status: Expired due to Fees

First Claim

Patent Images

1. A text-to-speech method configured to output speech having a target value of a speech factor,said method comprising:

inputting audio data with said target value of a speech factor;

adapting an acoustic model to said target value of a speech factor;

inputting text;

dividing said inputted text into a sequence of acoustic units;

converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and

outputting said sequence of speech vectors as audio with said target value of a speech factor,wherein said acoustic model comprises a set of speech factor parameters that enable the acoustic model to accommodate speech for the different values of a speech factor,and wherein said set of speech factor parameters are unlabeled, such that for a given one or more parameters, the value of said speech factor to which they relate is unknown,and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor,wherein said text to speech method includes training said acoustic model using a method comprising;

receiving speech data,said speech data comprising data corresponding to different values of the speech factor,and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown;

clustering said speech data according to the value of said speech factor into a first set of clusters; and

estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of training an acoustic model for a text-to-speech system,

- the method comprising:
- receiving speech data;
- said speech data comprising data corresponding to different values of a first speech factor,
- and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown;
  clustering said speech data according to the value of said first speech factor into a first set of clusters; and
- estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,
- wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.

8 Citations

View as Search Results

9 Claims

1. A text-to-speech method configured to output speech having a target value of a speech factor,said method comprising:
- inputting audio data with said target value of a speech factor;
  
  adapting an acoustic model to said target value of a speech factor;
  
  inputting text;
  
  dividing said inputted text into a sequence of acoustic units;
  
  converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and
  
  outputting said sequence of speech vectors as audio with said target value of a speech factor,wherein said acoustic model comprises a set of speech factor parameters that enable the acoustic model to accommodate speech for the different values of a speech factor,and wherein said set of speech factor parameters are unlabeled, such that for a given one or more parameters, the value of said speech factor to which they relate is unknown,and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor,wherein said text to speech method includes training said acoustic model using a method comprising;
  
  receiving speech data,said speech data comprising data corresponding to different values of the speech factor,and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown;
  
  clustering said speech data according to the value of said speech factor into a first set of clusters; and
  
  estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
- View Dependent Claims (2, 8)
- - 2. The text to speech method of claim 1,wherein said speech factor is expressionand the acoustic model further comprises a set of expression parameters relating to speaker and a set of clusters relating to speaker;
    - and wherein said set of expression parameters and said set of speaker parameters and said set of expression clusters and said set of speaker clusters do not overlap,and wherein the method is configured to transplant an expression from a first speaker to a second speaker, by employing expression parameters obtained from the speech of a first speaker with that of a second speaker.
  - 8. A non-transitory carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1.

3. A text to speech method, the method comprising:
- receiving input text;
  
  dividing said inputted text into a sequence of acoustic units;
  
  converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a set of speaker parameters and a set of speaker clusters relating to speaker voice and a set of expression parameters and a set of expression clusters relating to expression, and wherein the sets of speaker and expression parameters and the sets of speaker and expression clusters do not overlap; and
  
  outputting said sequence of speech vectors as audio,the method further comprising determining at least some of said parameters relating to expression by;
  
  extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and
  
  mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space,wherein said text to speech method includes training said acoustic model using a method comprising;
  
  receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech,said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression,and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown;
  
  clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and
  
  estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively,wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering.
- View Dependent Claims (4, 9)
- - 4. The method of claim 3 wherein said second space is the acoustic space of a first speaker and the method is configured to transplant the expressive synthesis feature vector to the acoustic space of a second speaker.
  - 9. A non-transitory carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 3.

5. A text to speech method, the method comprising:
- receiving input text;
  
  dividing said inputted text into a sequence of acoustic units;
  
  converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model; and
  
  outputting said sequence of speech vectors as audio, the method further comprising determining at least some of said second set of parameters by;
  
  extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and
  
  mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space,wherein said text to speech method includes training said acoustic model using a method comprising;
  
  receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech,said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression,and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown;
  
  clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and
  
  estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively,wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering,and wherein said first and second set of parameters and said first and second set of clusters do not overlap.

6. A system configured to output speech having a target value of a speech factor, said system comprising:
- an audio input for receiving audio data with said target value of a speech factor;
  
  a text input for receiving text; and
  
  a processor configured to adapt an acoustic model to said target value of a speech factor;
  
  divide said inputted text into a sequence of acoustic units;
  
  convert said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and
  
  output said sequence of speech vectors as audio with said target value of a speech factor,wherein said acoustic model comprises a first set of parameters that enable to acoustic model to accommodate speech for the different values of the speech factor;
  
  and wherein said first set of parameters are unlabeled, such that for a given one or more parameters, the value of said first speech factor is unknown,and wherein adapting the acoustic model comprises adjusting the speech factor parameters to substantially match the target value of a speech factor,wherein said text to speech method includes training said acoustic model using a method comprising;
  
  receiving speech data,said speech data comprising data corresponding to different values of the speech factor,and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said speech factor is unknown;
  
  clustering said speech data according to the value of said speech factor into a first set of clusters; and
  
  estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.

7. A text to speech system, the system comprising:
- an input for receiving input text; and
  
  a processor configured todivide said inputted text into a sequence of acoustic units;
  
  convert said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a first set of parameters and a first set of clusters relating to speaker voice and a second set of parameters and a second set of clusters relating to expression, and wherein the first and second set of parameters and the first and second set of clusters do not overlap; and
  
  output said sequence of speech vectors as audio,determine at least some of said parameters relating to expression by;
  
  extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and
  
  mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space,wherein said text to speech method includes training said acoustic model using a method comprising;
  
  receiving speech data, said speech data further comprising speech data from one or more speakers speaking with neutral speech,said speech data comprising data corresponding to different values of a first speech factor, wherein the first speech factor is speaker and speech data corresponding to different values of a second speech factor, wherein the second speech factor is expression,and wherein said speech data is unlabeled, such that for a given item of speech data, the value of said first speech factor is unknown;
  
  clustering said speech data according to the value of said first speech factor into a first set of clusters and clustering said speech data according to the value of said second speech factor into a second set of clusters; and
  
  estimating a first set and a second set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor and the second speech factor respectively,wherein said clustering and the parameter estimation are jointly performed according to a common maximum likelihood criterion which is common to both parameter estimation and said clustering.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Chen, Langzhou
Primary Examiner(s)
Harris, Keara

Application Number

US14/466,340
Publication Number

US 20150058019A1
Time in Patent Office

1,558 Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

G10L 15/06   Creation of reference templ...

Text to speech processing system and method, and an acoustic model training system and method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

8 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

Text to speech processing system and method, and an acoustic model training system and method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

8 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links