Speech processing system and method

US 8,620,655 B2
Filed: 08/10/2011
Issued: 12/31/2013
Est. Priority Date: 08/16/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A speech processing method, comprising:

receiving a speech input which comprises a sequence of feature vectors;

determining a likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;

providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and

adapting the acoustic model to the mismatched speech input,the speech processing method further comprising determining a likelihood of a sequence of features occurring in a given language using a language model; and

combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;

relating speech from the mismatched speaker input to the speech used to train the acoustic model using;

a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained;

and a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;

y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and

jointly estimating u and v,wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm,wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, andwherein said mismatch function f is of the form;

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech processing method, comprising:

- receiving a speech input which comprises a sequence of feature vectors;
- determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising:
- providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
- adapting the acoustic model to the mismatched speech input,
- the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and
- combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,
- wherein adapting the acoustic model to the mismatched speaker input comprises:
- relating speech from the mismatched speaker input to the speech used to train the acoustic model using: a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that:
  y=f(F(x,v),u)
- where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
- jointly estimating u and v.

12 Citations

View as Search Results

11 Claims

1. A speech processing method, comprising:
- receiving a speech input which comprises a sequence of feature vectors;
  
  determining a likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;
  
  providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
  
  adapting the acoustic model to the mismatched speech input,the speech processing method further comprising determining a likelihood of a sequence of features occurring in a given language using a language model; and
  
  combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;
  
  relating speech from the mismatched speaker input to the speech used to train the acoustic model using;
  
  a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained;
  
  and a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;
  
  y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
  
  jointly estimating u and v,wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm,wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, andwherein said mismatch function f is of the form;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 10)
- - 2. The method according to claim 1, wherein said speaker transform is a linear transform.
  - 3. The method according to claim 2, wherein said speaker transform is a vocal tract length normalisation transform.
  - 4. The method according to claim 3, wherein at least one speaker transform parameter is a discrete parameter.
  - 5. The method according to claim 1, wherein adapting the acoustic model further comprising using PCMLLR with the CMLLR form:
    - p(y|m)=|A_c^(r)|N(A_c^(r)y+b_c^(r);
      
      μ
      
      _x^(m),Σ
      
      _x^(m))where m denotes the m^thmixture in a Gaussian distribution of a Hidden Markov Model with mean μ
      
      _x^(m)and variance Σ
      
      _x^(m), A_c^(r)and b_c^(r)are CMLLR transforms to be estimated using PCMLLR techniques which minimises the divergence between the CMLLR form and a target distribution, said target distribution being derived from y=f(F(x, v), u).
  - 6. The method according to claim 1, wherein the adaptation is provided in an adaptive training framework.
  - 7. The method according to claim 1, wherein adapting the acoustic model to the mismatched speaker input comprises receiving speech from the mismatched speaker input corresponding to known text.
  - 8. The method according to claim 1, wherein adapting the acoustic model to the mismatched speaker input comprises receiving speech from said new speaker and making a first estimate of the text corresponding to said speech.
  - 10. A non-transitory carrier medium carrying computer readable instructions for controlling the computer to carry out the method of claim 1.

9. A method of adapting an acoustic model for speech processing to a mismatched speech input, said mismatched speech input being received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained, the method comprising:
- receiving a mismatched speech input which comprises a sequence of feature vectors; and
  
  providing an acoustic model for performing speech processing on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector,wherein adapting the acoustic model to the mismatched speaker input comprises;
  
  relating speech from mismatched speaker input to the speech used to train the acoustic model using;
  
  a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
  
  a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;
  
  y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
  
  jointly estimating u and v,wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm,wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, andwherein said mismatch function f is of the form;

11. A speech processing system, comprising:
- a receiver for receiving a speech input which comprises a sequence of feature vectors;
  
  anda processor configured to;
  
  determine a likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;
  
  provide an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
  
  adapt the acoustic model to the mismatched speech input,the processor being further configured to determine a likelihood of a sequence of features occurring in a given language using a language model; and
  
  combine the likelihoods determined by the acoustic model and the language model, the system further comprising an output configured to output a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;
  
  relating speech from mismatched speaker input to the speech used to train the acoustic model using;
  
  a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
  
  a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;
  
  y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
  
  jointly estimating u and v,wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm,wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, andwherein said mismatch function f is of the form;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Xu, Haitian, Chin, Kean Kheong, Gales, Mark John Francis
Primary Examiner(s)
Shah, Paras D
Assistant Examiner(s)
THOMAS-HOMESCU, ANNE L

Application Number

US13/206,872
Publication Number

US 20120041764A1
Time in Patent Office

874 Days
Field of Search

704/233, 704254-256
US Class Current

704/236
CPC Class Codes

G10L 15/065 Adaptation

Speech processing system and method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

12 Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Speech processing system and method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links