Dimensionality reduction of baum-welch statistics for speaker recognition
First Claim
Patent Images
1. A speaker recognition apparatus comprising:
- a computer configured to;
extract audio features from a received recognition speech signal;
generate first order Gaussian mixture model (GMM) statistics from the extracted audio features based on a universal background model that includes a plurality of speaker models;
normalize the first order GMM statistics with regard to a duration of the received speech signal;
train a deep neural network having a plurality of fully connected layers using a set of recognition speech signals; and
execute the deep neural network having the plurality of fully connected layers to reduce a dimensionality of the normalized first order GMM statistics and output a voiceprint corresponding to the recognition speech signal, the fully connected layers of the deep neural network including;
an input layer configured to receive the normalized first order GMM statistics;
one or more sequentially arranged first hidden layers configured to receive coefficients from the input layer; and
a last hidden layer arranged to receive coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers and configured to output the voiceprint corresponding to the recognition speech signal.
2 Assignments
0 Petitions
Accused Products
Abstract
In a speaker recognition apparatus, audio features are extracted from a received recognition speech signal, and first order Gaussian mixture model (GMM) statistics are generated therefrom based on a universal background model that includes a plurality of speaker models. The first order GMM statistics are normalized with regard to a duration of the received speech signal. The deep neural network reduces a dimensionality of the normalized first order GMM statistics, and outputs a voiceprint corresponding to the recognition speech signal.
85 Citations
14 Claims
-
1. A speaker recognition apparatus comprising:
a computer configured to; extract audio features from a received recognition speech signal; generate first order Gaussian mixture model (GMM) statistics from the extracted audio features based on a universal background model that includes a plurality of speaker models; normalize the first order GMM statistics with regard to a duration of the received speech signal; train a deep neural network having a plurality of fully connected layers using a set of recognition speech signals; and execute the deep neural network having the plurality of fully connected layers to reduce a dimensionality of the normalized first order GMM statistics and output a voiceprint corresponding to the recognition speech signal, the fully connected layers of the deep neural network including; an input layer configured to receive the normalized first order GMM statistics; one or more sequentially arranged first hidden layers configured to receive coefficients from the input layer; and a last hidden layer arranged to receive coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers and configured to output the voiceprint corresponding to the recognition speech signal. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
8. A method of generating a speaker model, the method comprising:
-
generating, by a computer, first order Gaussian mixture model (GMM) statistics from audio features extracted from a recognition speech signal, said GMM statistics being generated based on a universal background model that includes a plurality of speakers; normalizing, by the computer, the first order GMM statistics with regard to a duration of the received speech signal; training, by the computer, a deep neural network using a set of recognition speech signals; and reducing, by the computer, a dimensionality of the normalized first order GMM statistics using a plurality of fully connected feed-forward convolutional layers of the deep neural network and deriving a voiceprint corresponding to the recognition speech signal, wherein the reducing of the dimensionality of the normalized first order GMM statistics includes; receiving, by the computer, the normalized first order GMM statistics at an input layer of the plurality of fully connected feed-forward convolutional layers; receiving, by the computer, coefficients from the input layer at a first hidden layer of one or more sequentially arranged first hidden layers of the fully connected feed-forward convolutional layers, each first hidden layer receiving coefficients from a preceding layer of the plurality of fully connected feed-forward convolutional layers; receiving, by the computer at a last hidden layer, coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers; and outputting, by the computer from the last hidden layer, the voiceprint corresponding to the recognition speech signal. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification