Speech processing system and method
First Claim
Patent Images
1. A speech processing method, comprising:
- receiving a speech input which comprises a sequence of feature vectors;
determining a likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;
providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
adapting the acoustic model to the mismatched speech input,the speech processing method further comprising determining a likelihood of a sequence of features occurring in a given language using a language model; and
combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;
relating speech from the mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained;
and a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
jointly estimating u and v,wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm,wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, andwherein said mismatch function f is of the form;
1 Assignment
0 Petitions
Accused Products
Abstract
A speech processing method, comprising:
- receiving a speech input which comprises a sequence of feature vectors;
- determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising:
- providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
- adapting the acoustic model to the mismatched speech input,
- the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and
- combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,
- wherein adapting the acoustic model to the mismatched speaker input comprises:
- relating speech from the mismatched speaker input to the speech used to train the acoustic model using: a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that:
y=f(F(x,v),u) - where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
- jointly estimating u and v.
12 Citations
11 Claims
-
1. A speech processing method, comprising:
- receiving a speech input which comprises a sequence of feature vectors;
determining a likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising; providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and adapting the acoustic model to the mismatched speech input, the speech processing method further comprising determining a likelihood of a sequence of features occurring in a given language using a language model; and combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the acoustic model to the mismatched speaker input comprises; relating speech from the mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained;and a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v, wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm, wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, and wherein said mismatch function f is of the form; - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 10)
- receiving a speech input which comprises a sequence of feature vectors;
-
9. A method of adapting an acoustic model for speech processing to a mismatched speech input, said mismatched speech input being received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained, the method comprising:
-
receiving a mismatched speech input which comprises a sequence of feature vectors; and providing an acoustic model for performing speech processing on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein adapting the acoustic model to the mismatched speaker input comprises;
relating speech from mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v, wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm, wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, and wherein said mismatch function f is of the form;
-
-
11. A speech processing system, comprising:
-
a receiver for receiving a speech input which comprises a sequence of feature vectors; and a processor configured to; determine a likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising; provide an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and adapt the acoustic model to the mismatched speech input, the processor being further configured to determine a likelihood of a sequence of features occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model, the system further comprising an output configured to output a sequence of words identified from said speech input signal, wherein adapting the acoustic model to the mismatched speaker input comprises; relating speech from mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modeling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
a speaker transform F for primarily modeling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modeling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v, wherein said joint estimation of u and v is performed using the expectation maximization algorithm and comprises optimizing u and v in a single maximization step of said algorithm, wherein said at least one parameter u comprises parameters n and h, where n is used to model additive noise and h is used to model convolutional noise, and wherein said mismatch function f is of the form;
-
Specification