SPEECH PROCESSING SYSTEM AND METHOD

US 20150058019A1
Filed: 08/22/2014
Published: 02/26/2015
Est. Priority Date: 08/23/2013
Status: Active Grant

First Claim

Patent Images

1. A method of training an acoustic model for a text-to-speech system,the method comprising:

receiving speech data,said speech data comprising data corresponding to different values of a first speech factor,and wherein said speech data is unlabelled, such that for a given item of speech data, the value of said first speech factor is unknown;

clustering said speech data according to the value of said first speech factor into a first set of clusters; and

estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of training an acoustic model for a text-to-speech system,

- the method comprising:
- receiving speech data;
- said speech data comprising data corresponding to different values of a first speech factor,
- and wherein said speech data is unlabelled, such that for a given item of speech data, the value of said first speech factor is unknown;
  clustering said speech data according to the value of said first speech factor into a first set of clusters; and
- estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,
- wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.

Citations

24 Claims

1. A method of training an acoustic model for a text-to-speech system,the method comprising:
- receiving speech data,said speech data comprising data corresponding to different values of a first speech factor,and wherein said speech data is unlabelled, such that for a given item of speech data, the value of said first speech factor is unknown;
  
  clustering said speech data according to the value of said first speech factor into a first set of clusters; and
  
  estimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 18, 22)
- - 2. A method according to claim 1, wherein each of the first set of clusters comprises at least one sub-cluster, and wherein said first set of parameters are weights to be applied such there is one weight per sub-cluster, and wherein said weights are dependent on said first speech factor.
  - 3. A method according to claim 1, wherein said first set of parameters are constrained likelihood linear regression transforms which are dependent on said first speech factor.
  - 4. A method according to claim 1, wherein the first speech factor is speaker and said speech data further comprises speech data from one or more speakers speaking with neutral speech.
  - 5. A method according to claim 1, wherein the first speech factor is expression.
  - 6. A method according to claim 5, further comprisingreceiving text data corresponding to said received speech data;
    - extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space;
      
      extracting expressive features from the speech data and forming an expressive feature synthesis vector constructed in a second space; and
      
      training a machine learning algorithm, the training input of the machine learning algorithm being an expressive linguistic feature vector and the training output the expressive feature synthesis vector which corresponds to the speech data and the text data.
  - 7. A method according to claim 1 wherein said speech data further comprises data corresponding to different values of a second speech factor.
  - 8. A method according to claim 7, wherein the value of second speech factor is unknown;
    - and, wherein the method further comprisesclustering said speech data according to the value of said second speech factor into a second set of clusters; and
      
      estimating a second set of parameters to enable the acoustic model to accommodate speech for the different values of the second speech factor,wherein said first and second set of parameters and said first and second set of clusters do not overlap, andwherein said clustering and said second parameter estimation are jointly performed according to a single maximum likelihood criterion which is common to both said second parameter estimation and said clustering into a second set of clusters.
  - 9. A method according to claim 8, wherein each of the second set of clusters comprises at least one sub-cluster, and wherein said second set of parameters are weights to be applied such there is one weight per sub-cluster, and wherein said weights are dependent on said second speech factor.
  - 10. A method according to claim 8, wherein said second set of parameters are constrained likelihood linear regression transforms which are dependent on said second speech factor.
  - 11. A method according to claim 4, wherein said speech data further comprises data corresponding to different values of a second speech factor and wherein training the acoustic model further comprises:
    - clustering said speech data according to the value of said second speech factor into a second set of clusters; and
      
      estimating a second set of parameters to enable the acoustic model to accommodate speech for the different values of the second speech factor,wherein said clustering and said second parameter estimation are jointly performed according to a single maximum likelihood criterion which is common to both said second parameter estimation and said clustering into a second set of clusters,and wherein said first and second set of parameters and said first and second set of clusters do not overlap.
  - 12. A method according to claim 11, wherein the second speech factor is expression.
  - 13. A method according to claim 1, wherein the acoustic model comprises probability distribution functions which relate the acoustic units to a sequence of speech vectors.
  - 18. A text to speech method, the method comprising:
    - receiving input text;
      
      dividing said inputted text into a sequence of acoustic units;
      
      converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model was trained using the method of claim 12; and
      
      outputting said sequence of speech vectors as audio,the method further comprising determining at least some of said second set of parameters by;
      
      extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and
      
      mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space.
  - 22. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 1.

14. A text-to-speech method configured to output speech having a target value of a speech factor,said method comprising:
- inputting audio data with said target value of a speech factor;
  
  adapting an acoustic model to said target value of a speech factor;
  
  inputting text;
  
  dividing said inputted text into a sequence of acoustic units;
  
  converting said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and
  
  outputting said sequence of speech vectors as audio with said target value of a speech factor,wherein said acoustic model comprises a set of speech factor parameters relating to said speech factor, and a set of speech factor clusters relating to said speech factor,and wherein said set of speech factor parameters and said set of speech factor clusters relating to said speech factor are unlabelled, such that for a given one or more clusters and a given one or more parameters, the value of said speech factor to which they relate is unknown.
- View Dependent Claims (15, 23)
- - 15. The text to speech method of claim 14,wherein said speech factor is expressionand the acoustic model further comprises a set of parameters relating to speaker and a set of clusters relating to speaker;
    - and wherein said set of expression parameters and said set of speaker parameters and said set of expression clusters and said set of speaker clusters do not overlap,and wherein the method is configured to transplant an expression from a first speaker to a second speaker, by employing expression parameters obtained from the speech of a first speaker with that of a second speaker.
  - 23. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 14.

16. A text to speech method, the method comprising:
- receiving input text;
  
  dividing said inputted text into a sequence of acoustic units;
  
  converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a set of speaker parameters and a set of speaker clusters relating to speaker voice and a set of expression parameters and a set of expression clusters relating to expression, and wherein the sets of speaker and expression of parameters and the sets of speaker and expression clusters do not overlap; and
  
  outputting said sequence of speech vectors as audio,the method further comprising determining at least some of said parameters relating to expression by;
  
  extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and
  
  mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space.
- View Dependent Claims (17, 24)
- - 17. The method of claim 16 wherein said second space is the acoustic space of a first speaker and the method is configured to transplant the expressive synthesis feature vector to the acoustic space of a second speaker.
  - 24. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 16.

19. A system for training an acoustic model for a text-to-speech system, said system comprising:
- an input for receiving speech data corresponding to different values of a first speech factor,wherein said speech data is unlabelled, such that, for a given item of data, the value of said first speech factor is unknown;
  
  a processor configured to;
  
  cluster said speech data according to the value of said first speech factor into a first set of clusters; and
  
  estimate a first set of parameters to enable to the acoustic model to accommodate speech for the different values of the first speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a single maximum likelihood criterion which is common to both said first parameter estimation and said clustering into said first set of clusters.

20. A system configured to output speech having a target value of a speech factor, said system comprising:
- an input for receiving adaptation data with said target value of a speech factor;
  
  an input for receiving text; and
  
  a processor configured to adapt an acoustic model to said target value of a speech factor;
  
  divide said inputted text into a sequence of acoustic units;
  
  convert said sequence of acoustic units into a sequence of speech vectors using said acoustic model; and
  
  output said sequence of speech vectors as audio with said target value of a speech factor,wherein said acoustic model comprises a first set of parameters relating to said speech factor, and a first set of clusters relating to said speech factor,and wherein said first set of parameters and said first set of clusters relating to said speech factor are unlabelled, such that for a given one or more clusters and a given one or more parameters, the value of said first speech factor is unknown.

21. A text to speech system, the system comprising:
- an input for receiving input text; and
  
  a processor configured todivide said inputted text into a sequence of acoustic units;
  
  convert said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said acoustic model comprises a first set of parameters and a first set of clusters relating to speaker voice and a second set of parameters and a second set of clusters relating to expression, and wherein the first and second set of parameters and the first and second set of clusters do not overlap; and
  
  output said sequence of speech vectors as audio,determine at least some of said parameters relating to expression by;
  
  extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and
  
  mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
CHEN, Langzhou

Granted Patent

US 10,140,972 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/02   Methods for producing synth...

G10L 13/08   Text analysis or generation...

G10L 13/10   Prosody rules derived from ...

G10L 15/06   Creation of reference templ...

SPEECH PROCESSING SYSTEM AND METHOD

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

SPEECH PROCESSING SYSTEM AND METHOD

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links