Shared hidden layer combination for speech recognition systems
First Claim
1. A method of providing a framework for merging two or more automatic speech recognition (ASR) system having a shared deep neural network (DNN) feature transformation, comprising:
- receiving, by a computing device, at least one utterance;
training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprising a plurality of hidden layers;
generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance;
utilizing, by the computing device, the top hidden later output to generate a network comprising a bottleneck layer and an output layer;
extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction;
generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR;
combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and
training, for the merged system, senone coefficient data for evaluation of spoken utterances.
2 Assignments
0 Petitions
Accused Products
Abstract
Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation.
-
Citations
20 Claims
-
1. A method of providing a framework for merging two or more automatic speech recognition (ASR) system having a shared deep neural network (DNN) feature transformation, comprising:
-
receiving, by a computing device, at least one utterance; training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprising a plurality of hidden layers; generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance; utilizing, by the computing device, the top hidden later output to generate a network comprising a bottleneck layer and an output layer; extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction; generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR; combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and training, for the merged system, senone coefficient data for evaluation of spoken utterances. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
at least one processor; and a memory operatively connected with the at least one processor, wherein the memory stores computer-executable instructions that, when executed by the at least one processor, causes the at least one processor to execute a method that comprises; receiving, by a computing device, at least one utterance; training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprises a plurality of hidden layers; generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance; utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer; extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction; generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR system; combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and training, for the merged system, senone coefficient data for evaluation of spoken utterances. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A computer-readable storage device storing computer executable instructions which, when executed by a computer, cause computer to perform a method of providing a framework for merging systems having a shared deep neural network (DNN) feature transformation, the method comprising:
-
receiving a plurality of training utterances for speech recognition; training a first system with one or more of a cross entropy criterion and a sequential training criterion utilizing the plurality of training utterances, the DNN feature transformation comprising a plurality of hidden layers; generating an output from a top hidden layer in the plurality of hidden layers for the plurality of training utterances; utilizing the top hidden layer output to generate a network comprising a low dimension bottleneck hidden layer and a plurality of senones; extracting one or more weights between the top hidden layer and the low dimension hidden bottleneck layer, the one or more weights representing a feature dimension reduction; utilizing the feature dimension reduction to train a model for a second system following the extraction of the one or more weights between the top hidden layer and the low dimension bottleneck hidden layer; generating a first log likelihood score from the first system based on application of the feature dimension reduction to the first system and a second log likelihood score from the second system based on application of the feature dimension to the model of the second system; combining the first log likelihood score and the second log likelihood score to create a merged system from the first system and the second system, wherein the first system and the second system share the DNN feature transformation; and training senone dependent combination coefficients from the merged system with the one or more of the cross entropy criterion and the sequential training criterion. - View Dependent Claims (18, 19, 20)
-
-
17. The computer-readable storage medium, wherein the method further comprising receiving a spoken utterance, processing the spoken utterance using the senone dependent combination coefficients, and outputting automatic speech recognition (ASR) results data based on the processing of the spoken utterances.
Specification