SPEECH PROCESSING SYSTEM AND METHOD

US 20120041764A1
Filed: 08/10/2011
Published: 02/16/2012
Est. Priority Date: 08/16/2010
Status: Active Grant

First Claim

Patent Images

1. A speech processing method, comprising:

receiving a speech input which comprises a sequence of feature vectors;

determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;

providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and

adapting the acoustic model to the mismatched speech input,the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and

combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;

relating speech from the mismatched speaker input to the speech used to train the acoustic model using;

a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and

a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;

y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and

jointly estimating u and v.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech processing method, comprising:

- receiving a speech input which comprises a sequence of feature vectors;
- determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising:
- providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
- adapting the acoustic model to the mismatched speech input,
- the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and
- combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,
- wherein adapting the acoustic model to the mismatched speaker input comprises:
- relating speech from the mismatched speaker input to the speech used to train the acoustic model using: a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that:

y=f(F(x,v),u)

- where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
- jointly estimating u and v.

22 Citations

View as Search Results

14 Claims

1. A speech processing method, comprising:
- receiving a speech input which comprises a sequence of feature vectors;
  
  determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;
  
  providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
  
  adapting the acoustic model to the mismatched speech input,the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and
  
  combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;
  
  relating speech from the mismatched speaker input to the speech used to train the acoustic model using;
  
  a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
  
  a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;
  
  y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
  
  jointly estimating u and v.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13)
- - 2. A method according to claim 1, wherein said at least one parameter u comprises parameters n and h where n is used to model additive noise and h is used to model convolutional noise.
  - 3. A method according to claim 2, wherein said mismatch function f is of the form:
  - 4. A method according to claim 1, wherein said speaker transform is a linear transform.
  - 5. A method according to claim 4, wherein said speaker transform is a vocal tract length normalisation transform.
  - 6. A method according to claim 5, wherein at least one speaker transform parameter is a discrete parameter.
  - 7. A method according to claim 1, wherein said joint estimation of u and v is performed using the expectation maximisation algorithm.
  - 8. A method according to claim 1, wherein adapting the acoustic model further comprising using PCMLLR with the CMLLR form:
    - p(y|m)=|A_c^(r)|N(A_c^(r)y+b_c^(r);
      
      μ
      
      _x^(m);
      
      Σ
      
      _x^(m))where m denotes the m^thmixture in a Gaussian distribution of a Hidden Markov Model with mean μ
      
      _x^(m)and variance Σ
      
      _x^(m), A_c^(r)and b_c^(r)are CMLLR transforms to be estimated using PCMLLR techniques which minimises the divergence between the CMLLR form and a target distribution, said target distribution being derived from y=f(F(x, v), u).
  - 9. A method according to claim 1, wherein the adaptation is provided in an adaptive training framework.
  - 10. A method according to claim 1, wherein adapting the acoustic model to the mismatched speaker input comprises receiving speech from the mismatched speaker input corresponding to known text.
  - 11. A method according to claim 1, wherein adapting the acoustic model to the mismatched speaker input comprises receiving speech from said new speaker and making a first estimate of the text corresponding to said speech.
  - 13. A carrier medium carrying computer readable instructions for controlling the computer to carry out the method of claim 1.

12. A method of adapting an acoustic model for speech processing to a mismatched speech input, said mismatched speech input being received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained, the method comprising:
- receiving a mismatched speech input which comprises a sequence of feature vectors; and
  
  providing an acoustic model for performing speech processing on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector,wherein adapting the acoustic model to the mismatched speaker input comprises;
  
  relating speech from mismatched speaker input to the speech used to train the acoustic model using;
  
  a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
  
  a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;
  
  y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
  
  jointly estimating u and v.

14. A speech processing system, comprising:
- a receiver for receiving a speech input which comprises a sequence of feature vectors; and
  
  a processor configured to;
  
  determine the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;
  
  provide an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
  
  adapt the acoustic model to the mismatched speech input,the processor being further configured to determine the likelihood of a sequence of features occurring in a given language using a language model; and
  
  combine the likelihoods determined by the acoustic model and the language model,the system further comprising an output configured to output a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;
  
  relating speech from mismatched speaker input to the speech used to train the acoustic model using;
  
  a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
  
  a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;
  
  y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
  
  jointly estimating u and v.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Chin, Kean Kheong, Gales, Mark John Francis, XU, Haitian

Granted Patent

US 8,620,655 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/256.1
CPC Class Codes

G10L 15/065 Adaptation

SPEECH PROCESSING SYSTEM AND METHOD

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

22 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

SPEECH PROCESSING SYSTEM AND METHOD

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others