SPEECH PROCESSING SYSTEM AND METHOD
First Claim
Patent Images
1. A speech processing method, comprising:
- receiving a speech input which comprises a sequence of feature vectors;
determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising;
providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
adapting the acoustic model to the mismatched speech input,the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and
combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,wherein adapting the acoustic model to the mismatched speaker input comprises;
relating speech from the mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
jointly estimating u and v.
1 Assignment
0 Petitions
Accused Products
Abstract
A speech processing method, comprising:
- receiving a speech input which comprises a sequence of feature vectors;
- determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising:
- providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and
- adapting the acoustic model to the mismatched speech input,
- the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and
- combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal,
- wherein adapting the acoustic model to the mismatched speaker input comprises:
- relating speech from the mismatched speaker input to the speech used to train the acoustic model using: a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that:
y=f(F(x,v),u)
- where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and
- jointly estimating u and v.
22 Citations
14 Claims
-
1. A speech processing method, comprising:
-
receiving a speech input which comprises a sequence of feature vectors; determining the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising; providing an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and adapting the acoustic model to the mismatched speech input, the speech processing method further comprising determining the likelihood of a sequence of features occurring in a given language using a language model; and combining the likelihoods determined by the acoustic model and the language model and outputting a sequence of words identified from said speech input signal, wherein adapting the acoustic model to the mismatched speaker input comprises; relating speech from the mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13)
-
-
12. A method of adapting an acoustic model for speech processing to a mismatched speech input, said mismatched speech input being received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained, the method comprising:
-
receiving a mismatched speech input which comprises a sequence of feature vectors; and providing an acoustic model for performing speech processing on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein adapting the acoustic model to the mismatched speaker input comprises; relating speech from mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v.
-
-
14. A speech processing system, comprising:
-
a receiver for receiving a speech input which comprises a sequence of feature vectors; and a processor configured to; determine the likelihood of a sequence of words arising from the sequence of feature vectors using an acoustic model and a language model, comprising; provide an acoustic model for performing speech recognition on an input signal which comprises a sequence of feature vectors, said model having a plurality of model parameters relating to the probability distribution of a word or part thereof being related to a feature vector, wherein said speech input is a mismatched speech input which is received from a speaker in an environment which is not matched to the speaker or environment under which the acoustic model was trained; and adapt the acoustic model to the mismatched speech input, the processor being further configured to determine the likelihood of a sequence of features occurring in a given language using a language model; and combine the likelihoods determined by the acoustic model and the language model, the system further comprising an output configured to output a sequence of words identified from said speech input signal, wherein adapting the acoustic model to the mismatched speaker input comprises; relating speech from mismatched speaker input to the speech used to train the acoustic model using;
a mismatch function f for primarily modelling differences between the environment of the speaker and the environment under which the acoustic model was trained; and
a speaker transform F for primarily modelling differences between the speaker of the mismatched speaker input, such that;
y=f(F(x,v),u)where y represents the speech from the mismatched speaker input, x is the speech used to train the acoustic model, u represents at least one parameter for modelling changes in the environment and v represents at least one parameter used for mapping differences between speakers; and jointly estimating u and v.
-
Specification