SHARED HIDDEN LAYER COMBINATION FOR SPEECH RECOGNITION SYSTEMS
First Claim
1. A method of providing a framework for merging a plurality of automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation, comprising:
- receiving, by a computing device, at least one utterance;
training, by the computing device, a DNN feature transformation with a criterion utilizing the received at least one utterance, the DNN feature transformation comprising a plurality of hidden layers;
generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance;
utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer;
extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction;
generating, by the computing device, a plurality of scores from a first ASR system and a second ASR system in the plurality of ASR systems; and
combining, by the computing device, the plurality of scores from the first ASR system and the second ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation.
3 Assignments
0 Petitions
Accused Products
Abstract
Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation.
-
Citations
20 Claims
-
1. A method of providing a framework for merging a plurality of automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation, comprising:
-
receiving, by a computing device, at least one utterance; training, by the computing device, a DNN feature transformation with a criterion utilizing the received at least one utterance, the DNN feature transformation comprising a plurality of hidden layers; generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance; utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer; extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction; generating, by the computing device, a plurality of scores from a first ASR system and a second ASR system in the plurality of ASR systems; and combining, by the computing device, the plurality of scores from the first ASR system and the second ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A speech recognition system comprising:
-
a DNN system for generating a DNN-derived feature; a plurality of back end systems for utilizing the DNN-derived feature; and a feature transformation for receiving a plurality of utterances, the feature transformation being generated by the DNN system and being shared by the plurality of back end systems, an output of the shared feature transformation being utilized by the plurality of back end systems to generate a single senone log likelihood for speech recognition. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computer-readable storage medium storing computer executable instructions which, when executed by a computer, will cause computer to perform a method of providing a framework for merging a plurality of automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation, the method comprising:
-
receiving a plurality of training utterances for speech recognition; training a DNN system with one or more of a cross entropy criterion and a sequential training criterion utilizing the plurality of training utterances, the DNN feature transformation comprising a plurality of hidden layers; generating an output from a top hidden layer in the plurality of hidden layers for the plurality of training utterances; utilizing the top hidden layer output to generate a network comprising a low dimension bottleneck hidden layer and a plurality of senones; extracting one or more weights between the top hidden layer and the low dimension hidden bottleneck layer, the one or more weights representing a feature dimension reduction; utilizing the feature dimension reduction to train a model following the extraction of the one or more weights between the top hidden layer and the low dimension bottleneck hidden layer; generating a plurality of log likelihood scores from the DNN system and another ASR system in the plurality of ASR systems; combining the plurality of scores from the DNN system and the another ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation; and training senone dependent combination coefficients with the one or more of the cross entropy criterion and a sequential training criterion. - View Dependent Claims (19, 20)
-
- 18. The computer-readable storage medium of claim 18, wherein the DNN system comprises a Context Dependent-Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and the another ASR system comprises a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system.
Specification