On-line background noise adaptation of parallel model combination HMM with discriminative learning using weighted HMM for noisy speech recognition
First Claim
1. A method of generating a composite noisy speech model, comprising the steps of:
- generating frames of current input utterances based on received speech data, determining which of said generated frames are aligned with noisy states to produce a current noise model, re-estimating the produced current noise model by interpolating the number of frames in said current noise model with parameters from a previous noise model, combining the parameters of said current noise model with templates of a corresponding current clean speech model to generate templates of a composite noisy speech model, determining a discrimination function by generating a weighted current noise model based on said composite noisy speech model, determining a distance function by measuring the degree of mis-recognition based on said discrimination function, determining a loss function based on said distance function, said loss function being approximately equal to said distance function, determining a risk function representing the mean value of said loss function, and generating a current discriminative noise model based in part on said risk function, such that the input utterances correspond more accurately with the predetermined templates of the composite noisy speech model.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for adaptively generating a composite noisy speech model to process speech in, e.g., a nonstationary environment comprises a speech recognizer, a re-estimation circuit, a combiner circuit, a classifier circuit, and a discrimination circuit. In particular, the speech recognizer generates frames of current input utterances based on received speech data and determines which of the generated frames are aligned with noisy states to produce a current noise model. The re-estimation circuit re-estimates the produced current noise model by interpolating the number of frames in the current noise model with parameters from a previous noise model. The combiner circuit combines the parameters of the current noise model with model parameters of a corresponding current clean speech model to generate model parameters of a composite noisy speech model. The classifier circuit determines a discrimination function by generating a weighted PMC HMM model. The discrimination learning circuit determines a distance function by measuring the degree of mis-recognition based on the discrimination function, determines a loss function based on the distance function, which is approximately equal to the distance function, determines a risk function representing the mean value of the loss function, and generates a current discriminative noise model based in part on the risk function, such that the input utterances correspond more accurately with the predetermined model parameters of the composite noisy speech model.
147 Citations
20 Claims
-
1. A method of generating a composite noisy speech model, comprising the steps of:
-
generating frames of current input utterances based on received speech data, determining which of said generated frames are aligned with noisy states to produce a current noise model, re-estimating the produced current noise model by interpolating the number of frames in said current noise model with parameters from a previous noise model, combining the parameters of said current noise model with templates of a corresponding current clean speech model to generate templates of a composite noisy speech model, determining a discrimination function by generating a weighted current noise model based on said composite noisy speech model, determining a distance function by measuring the degree of mis-recognition based on said discrimination function, determining a loss function based on said distance function, said loss function being approximately equal to said distance function, determining a risk function representing the mean value of said loss function, and generating a current discriminative noise model based in part on said risk function, such that the input utterances correspond more accurately with the predetermined templates of the composite noisy speech model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
where λ
(n) represents said parameters of said previous noise model, λ
(κ
) represents the parameters of frames of said current noise model, and λ
(n+κ
) represents said re-estimated current noise model.
-
-
3. The method of claim 2, wherein said generated frames aligned with noisy states are determined by a Viterbi decoding scheme.
-
4. The method of claim 3, wherein said combining the parameters of the re-estimated current noise model with parameters of a corresponding current clean speech model to generate a composite noisy speech model is done by using a method of parallel model combination.
-
5. The method of claim 4, wherein said discrimination function being:
-
where O=o1, o2 . . . , oT represents an input feature vector of T number of frames, K is the total number of states, SCj,i represents the corresponding accumulated log probability of state i in class j, and Wj,i represents the corresponding weight of state i in class j.
-
-
6. The method of claim 1, wherein the current parameter is generated by the steps of:
-
determining a distance function by measuring the degree of mis-recognition based on the discrimination function, determining a loss function based on the distance function, determining a risk function for representing the mean value of the lose function, and generating the current weighted parameters based in part on the risk function.
-
-
7. The method of claim 6, wherein said distance function being:
-
8. The method of claim 6, wherein said loss function being:
-
where d0 is a positive function.
-
-
9. The method of claim 6, wherein said risk function being:
-
where O=O1, O2, . . . , ON, and Ok represents a kth training speech data.
-
-
10. The method of claim 9, wherein said current discriminative noise model being represented by;
-
where τ
(τ
>
0) is a preset margin, ε
(l) is a learning constant that is a decreasing function of l, and U is a positive-definitive matrix, such as an identity matrix.
-
-
11. A system for generating a composite noisy speech model, comprising:
-
a speech recognizer for generating frames of current input utterances based on received speech data, and for determining which of said generated frames are aligned with noisy states to produce a current noise model, a re-estimation circuit for re-estimating the produced current noise model by interpolating the number of frames in said current noise model with parameters from a previous noise model, a combiner circuit for combining the parameters of said current noise model with templates of a corresponding current clean speech model to generate templates of a composite noisy speech model, a classifier circuit for determining a discrimination function by generating a weighted current noise model based on said composite noisy speech model, and a discrimination learning circuit, for determining a distance function by measuring the degree of mis-recognition based on said discrimination function, for determining a loss function based on said distance function, said loss function being approximately equal to said distance function, for determining a risk function representing the mean value of said loss function, and for generating a current discriminative noise model based in part on said risk function, such that the input utterances correspond more accurately with the predetermined templates of the composite noisy speech model. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
where λ
(n) represents said parameters of said previous noise model, λ
(κ
) represents the parameters of frames of said current noise model, and λ
(n+κ
) represents said re-estimated current noise model.
-
-
13. The system of claim 12, wherein said generated frames aligned with noisy states are determined by a Viterbi decoding scheme.
-
14. The system of claim 13, wherein said combining the parameters of the re-estimated current noise model with parameters of a corresponding current clean speech model to generate a composite noisy speech model is done by using a method of parallel model combination.
-
15. The system of claim 11, wherein the current parameter is generated by the steps of:
-
determining a distance function by measuring the degree of mis-recognition based on the discrimination function, determining a loss function based on the distance function, determining a risk function for representing the mean value of the los function, and generating the current weighted parameters based in part on the risk function.
-
-
16. The system of claim 14, wherein said discrimination function being:
-
where O=o1, o2 . . . , oT represents an input feature vector of T number of frames, K is the total number of states, SCj,i represents the corresponding accumulated log probability of state i in class j, and Wj,i represents the corresponding weight of state i in class j.
-
-
17. The system of claim 15, wherein said distance function being:
-
18. The system of claim 15, wherein said loss function being:
-
where d0 is a positive function.
-
-
19. The system of claim 15, wherein said risk function being:
-
where O=O1, O2, . . . , ON, and Ok represents a kth training speech data.
-
-
20. The system of claim 19, wherein said current discriminative noise model being represented by:
-
where τ
(τ
>
0) is a preset margin, ε
(l) is a learning constant that is a decreasing function of 1, and U is a positive-definite matrix, such as an identity matrix.
-
Specification