Dimensionality reduction of baum-welch statistics for speaker recognition

US 10,553,218 B2
Filed: 09/19/2017
Issued: 02/04/2020
Est. Priority Date: 09/19/2016
Status: Active Grant

First Claim

Patent Images

1. A speaker recognition apparatus comprising:

a computer configured to;

extract audio features from a received recognition speech signal;

generate first order Gaussian mixture model (GMM) statistics from the extracted audio features based on a universal background model that includes a plurality of speaker models;

normalize the first order GMM statistics with regard to a duration of the received speech signal;

train a deep neural network having a plurality of fully connected layers using a set of recognition speech signals; and

execute the deep neural network having the plurality of fully connected layers to reduce a dimensionality of the normalized first order GMM statistics and output a voiceprint corresponding to the recognition speech signal, the fully connected layers of the deep neural network including;

an input layer configured to receive the normalized first order GMM statistics;

one or more sequentially arranged first hidden layers configured to receive coefficients from the input layer; and

a last hidden layer arranged to receive coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers and configured to output the voiceprint corresponding to the recognition speech signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a speaker recognition apparatus, audio features are extracted from a received recognition speech signal, and first order Gaussian mixture model (GMM) statistics are generated therefrom based on a universal background model that includes a plurality of speaker models. The first order GMM statistics are normalized with regard to a duration of the received speech signal. The deep neural network reduces a dimensionality of the normalized first order GMM statistics, and outputs a voiceprint corresponding to the recognition speech signal.

85 Citations

View as Search Results

14 Claims

1. A speaker recognition apparatus comprising:
- a computer configured to;
  
  extract audio features from a received recognition speech signal;
  
  generate first order Gaussian mixture model (GMM) statistics from the extracted audio features based on a universal background model that includes a plurality of speaker models;
  
  normalize the first order GMM statistics with regard to a duration of the received speech signal;
  
  train a deep neural network having a plurality of fully connected layers using a set of recognition speech signals; and
  
  execute the deep neural network having the plurality of fully connected layers to reduce a dimensionality of the normalized first order GMM statistics and output a voiceprint corresponding to the recognition speech signal, the fully connected layers of the deep neural network including;
  
  an input layer configured to receive the normalized first order GMM statistics;
  
  one or more sequentially arranged first hidden layers configured to receive coefficients from the input layer; and
  
  a last hidden layer arranged to receive coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers and configured to output the voiceprint corresponding to the recognition speech signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The speaker recognition apparatus according to claim 1, wherein the fully connected layers of the deep neural network further include an output layer for use in a training mode of the deep neural network, the computer configured to execute the output layer to receive coefficients from the last hidden layer and to calculate a plurality of output coefficients at a respective plurality of output units that correspond to distinct speakers represented in the set of recognition speech signals used for training the deep neural network;
    - andthe computer further configured to;
      
      receive the plurality of output coefficients and to calculate a loss result from the plurality of output coefficients, andlower the calculated loss result at each of a plurality of iterations by modifying one or more connection weights of the fully connected layers.
  - 3. The speaker recognition apparatus according to claim 2, wherein the computer is configured to utilize backpropagation to modify the connection weights of the fully connected layers based on the loss result during the training mode.
  - 4. The speaker recognition apparatus according to claim 2, wherein the computer is configured to utilize a categorical cross entropy function to calculate the loss result.
  - 5. The speaker recognition apparatus according to claim 1, wherein the number of the one or more first hidden layers is four.
  - 6. The speaker recognition apparatus according to claim 1, wherein for each received recognition speech signal the computer is configured to measure a duration of the received recognition speech signal and modify the first order statistics to correspond to a predetermined uniform duration.
  - 7. The speaker recognition apparatus according to claim 6, wherein computer is configured to randomly exclude up to 90% of the first order statistics from being received by the deep neural network.

8. A method of generating a speaker model, the method comprising:
- generating, by a computer, first order Gaussian mixture model (GMM) statistics from audio features extracted from a recognition speech signal, said GMM statistics being generated based on a universal background model that includes a plurality of speakers;
  
  normalizing, by the computer, the first order GMM statistics with regard to a duration of the received speech signal;
  
  training, by the computer, a deep neural network using a set of recognition speech signals; and
  
  reducing, by the computer, a dimensionality of the normalized first order GMM statistics using a plurality of fully connected feed-forward convolutional layers of the deep neural network and deriving a voiceprint corresponding to the recognition speech signal, wherein the reducing of the dimensionality of the normalized first order GMM statistics includes;
  
  receiving, by the computer, the normalized first order GMM statistics at an input layer of the plurality of fully connected feed-forward convolutional layers;
  
  receiving, by the computer, coefficients from the input layer at a first hidden layer of one or more sequentially arranged first hidden layers of the fully connected feed-forward convolutional layers, each first hidden layer receiving coefficients from a preceding layer of the plurality of fully connected feed-forward convolutional layers;
  
  receiving, by the computer at a last hidden layer, coefficients from one hidden layer of the one or more first hidden layers, the last hidden layer having a dimension smaller than each of the one or more first hidden layers; and
  
  outputting, by the computer from the last hidden layer, the voiceprint corresponding to the recognition speech signal.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method according to claim 8, further comprising:
    - in a training mode, receiving, by the computer at an output layer of the fully connected feed-forward convolutional layers of the deep neural network, coefficients from the last hidden layer;
      
      calculating, by the computer, a plurality of output coefficients for output at a respective plurality of output units of the output layer, the number of output units corresponding to distinct speakers represented in the set of recognition speech signals used for training the deep neural network;
      
      receiving, by the computer, the plurality of output coefficients; and
      
      calculating, by the computer, a loss result from the plurality of output coefficients,wherein the computer lowers the calculated loss result at each of a plurality of training iterations by modifying one or more connection weights of the fully connected layers.
  - 10. The method according to claim 9, further comprising:
    - performing, by the computer, backpropagation to modify connection weights of the fully connected feed-forward convolutional layers based on the loss result.
  - 11. The method according to claim 9, further comprising:
    - calculating, by the computer, the loss result utilizing a categorical cross entropy function.
  - 12. The method according to claim 8, wherein the number of the one or more first hidden layers is four.
  - 13. The method according to claim 8, wherein for each received recognition speech signal said normalizing the first order GMM statistics includes:
    - measuring, by the computer, a duration of the received recognition speech signal, andmodifying, by the computer, the first order statistics to correspond to a predetermined uniform duration.
  - 14. The method according to claim 8, further comprising:
    - excluding, by the computer, a majority of the first statistics from being from being received by the deep neural network by using a dropout technique.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Pindrop Security, Inc.
Original Assignee
Pindrop Security, Inc.
Inventors
Khoury, Elie, Garland, Matthew
Primary Examiner(s)
Guerra-Erazo, Edgar X

Application Number

US15/709,232
Publication Number

US 20180082691A1
Time in Patent Office

868 Days
Field of Search
US Class Current
CPC Class Codes

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

G10L 17/18   Artificial neural networks;...

Dimensionality reduction of baum-welch statistics for speaker recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

85 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

Dimensionality reduction of baum-welch statistics for speaker recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

85 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others