Estimating speaker-specific affine transforms for neural network based speech recognition systems
First Claim
1. A computer-implemented method comprising:
- under control of one or more computing devices configured with specific computer-executable instructions,obtaining a Gaussian mixture model-based (“
GMM-based”
) acoustic model;
obtaining a neural network-based (“
NN-based”
) acoustic model;
receiving an audio signal comprising speech;
computing a first sequence of feature vectors from the audio signal;
computing a GMM-based transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the GMM-based transform comprises a first linear portion and a first bias portion;
computing a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the second linear portion and the first linear portion;
computing a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the second bias portion and the first bias portion;
computing a second sequence of feature vectors from the audio signal;
computing a third sequence of feature vectors by applying the second linear portion and the second bias portion of the NN-based transform to the second sequence of feature vectors;
performing speech recognition using the third sequence of feature vectors and the NN-based acoustic model generate speech processing results; and
determining, using the speech processing results, an action to perform.
1 Assignment
0 Petitions
Accused Products
Abstract
Features are disclosed for estimating affine transforms in Log Filter-Bank Energy Space (“LFBE” space) in order to adapt artificial neural network-based acoustic models to a new speaker or environment. Neural network-based acoustic models may be trained using concatenated LFBEs as input features. The affine transform may be estimated by minimizing the least squares error between corresponding linear and bias transform parts for the resultant neural network feature vector and some standard speaker-specific feature vector obtained for a GMM-based acoustic model using constrained Maximum Likelihood Linear Regression (“cMLLR”) techniques. Alternatively, the affine transform may be estimated by minimizing the least squares error between the resultant transformed neural network feature and some standard speaker-specific feature obtained for a GMM-based acoustic model.
13 Citations
20 Claims
-
1. A computer-implemented method comprising:
under control of one or more computing devices configured with specific computer-executable instructions, obtaining a Gaussian mixture model-based (“
GMM-based”
) acoustic model;obtaining a neural network-based (“
NN-based”
) acoustic model;receiving an audio signal comprising speech; computing a first sequence of feature vectors from the audio signal; computing a GMM-based transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the GMM-based transform comprises a first linear portion and a first bias portion; computing a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the second linear portion and the first linear portion; computing a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the second bias portion and the first bias portion; computing a second sequence of feature vectors from the audio signal; computing a third sequence of feature vectors by applying the second linear portion and the second bias portion of the NN-based transform to the second sequence of feature vectors; performing speech recognition using the third sequence of feature vectors and the NN-based acoustic model generate speech processing results; and determining, using the speech processing results, an action to perform. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. A system comprising:
-
a computer-readable memory storing executable instructions; and one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least; obtain a Gaussian mixture model-based (“
GMM-based”
) acoustic model;obtain a neural network-based (“
NN-based”
) acoustic model;compute a first sequence of feature vectors from audio data; compute a first transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the first transform comprises a first linear portion and a first bias portion; compute a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the first linear portion and the second linear portion; compute a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the first bias portion and the second bias portion; compute a second sequence of feature vectors from an audio signal comprising user speech; compute a third sequence of feature vectors by applying the second linear portion and the second bias portion to the second sequence of feature vectors; perform speech recognition on the audio signal using the third sequence of feature vectors and the NN-based acoustic model; and determine using results of the speech recognition, an action to perform. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. One or more non-transitory computer readable media comprising executable code that, when executed, cause one or more computing devices to perform a process comprising:
-
obtaining a Gaussian mixture model-based (“
GMM-based”
) acoustic model;obtaining a neural network-based (“
NN-based”
) acoustic model;computing a first sequence of feature vectors from audio data; computing a first transform using the GMM-based acoustic model and the first sequence of feature vectors, wherein the first transform comprises a first linear portion and a first bias portion; computing a second linear portion of a NN-based transform by minimizing a first least squares difference function, wherein the first least squares difference function comprises a difference between the first linear portion and the second linear portion; computing a second bias portion of the NN-based transform by minimizing a second least squares difference function, wherein the second least squares difference function comprises a difference between the first bias portion and the second bias portion; computing a second sequence of feature vectors from an audio signal comprising speech; computing a third sequence of feature vectors by applying the second linear portion and the second bias portion to the second sequence of feature vectors; performing speech recognition on the audio signal using the third sequence of feature vectors and the NN-based acoustic model; and determining, using results of the speech recognition, an action to perform. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification