Channel-compensated low-level features for speaker recognition
First Claim
1. A system for generating channel-compensated low level features for speaker recognition, the system comprising:
- an acoustic channel simulator configured to receive a recognition speech signal, degrade the recognition speech signal to include characteristics of an audio channel, and output a degraded speech;
a first feed forward convolutional neural network configured, in a training mode, to receive the degraded speech signal, and to derive from the degraded speech signal a plurality of channel-compensated low-level features, and further configured, in a test and enrollment mode, to receive the recognition speech signal and to calculate from the recognition speech signal a plurality of the channel-compensated low-level features;
a speech signal analyzer configured, in the training mode, to extract features of the recognition speech signal;
a loss function processor configured to calculate a loss based on the features from the speech analyzer and the channel-compensated low-level features from the first feed forward convolutional neural network;
wherein, the calculated loss at each of a plurality of training iterations is lowered by modifying one or more connection weights of the first feed forward convolutional neural network, andif the calculated loss is less than or equal to the threshold loss, or a maximum number of training iterations has been met, the training mode is terminated.
3 Assignments
0 Petitions
Accused Products
Abstract
A system for generating channel-compensated features of a speech signal includes a channel noise simulator that degrades the speech signal, a feed forward convolutional neural network (CNN) that generates channel-compensated features of the degraded speech signal, and a loss function that computes a difference between the channel-compensated features and handcrafted features for the same raw speech signal. Each loss result may be used to update connection weights of the CNN until a predetermined threshold loss is satisfied, and the CNN may be used as a front-end for a deep neural network (DNN) for speaker recognition/verification. The DNN may include convolutional layers, a bottleneck features layer, multiple fully-connected layers and an output layer. The bottleneck features may be used to update connection weights of the convolutional layers, and dropout may be applied to the convolutional layers.
-
Citations
28 Claims
-
1. A system for generating channel-compensated low level features for speaker recognition, the system comprising:
-
an acoustic channel simulator configured to receive a recognition speech signal, degrade the recognition speech signal to include characteristics of an audio channel, and output a degraded speech; a first feed forward convolutional neural network configured, in a training mode, to receive the degraded speech signal, and to derive from the degraded speech signal a plurality of channel-compensated low-level features, and further configured, in a test and enrollment mode, to receive the recognition speech signal and to calculate from the recognition speech signal a plurality of the channel-compensated low-level features; a speech signal analyzer configured, in the training mode, to extract features of the recognition speech signal; a loss function processor configured to calculate a loss based on the features from the speech analyzer and the channel-compensated low-level features from the first feed forward convolutional neural network; wherein, the calculated loss at each of a plurality of training iterations is lowered by modifying one or more connection weights of the first feed forward convolutional neural network, and if the calculated loss is less than or equal to the threshold loss, or a maximum number of training iterations has been met, the training mode is terminated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method of training a deep neural network (DNN) with channel-compensated low-level features for speaker recognition, the method comprising:
-
receiving a recognition speech signal; degrading the recognition speech signal to produce a channel-compensated speech signal; extracting, using a first feed forward convolutional neural network, a plurality of low-level features from the channel-compensated speech signal; calculating a loss result using the channel-compensated low-level features extracted from the channel-compensated speech signal and hand-crafted features extracted from the recognition speech signal; modifying connection weights of the first feed forward convolutional neural network to lower the calculated loss result at each of a plurality of training iterations; and terminating the training if the calculated loss result is less than or equal to a threshold loss or a maximum number of training iterations has been met. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28)
-
Specification