SHARED HIDDEN LAYER COMBINATION FOR SPEECH RECOGNITION SYSTEMS

US 20150310858A1
Filed: 04/29/2014
Published: 10/29/2015
Est. Priority Date: 04/29/2014
Status: Active Grant

First Claim

Patent Images

1. A method of providing a framework for merging a plurality of automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation, comprising:

receiving, by a computing device, at least one utterance;

training, by the computing device, a DNN feature transformation with a criterion utilizing the received at least one utterance, the DNN feature transformation comprising a plurality of hidden layers;

generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance;

utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer;

extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction;

generating, by the computing device, a plurality of scores from a first ASR system and a second ASR system in the plurality of ASR systems; and

combining, by the computing device, the plurality of scores from the first ASR system and the second ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation.

Citations

20 Claims

1. A method of providing a framework for merging a plurality of automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation, comprising:
- receiving, by a computing device, at least one utterance;
  
  training, by the computing device, a DNN feature transformation with a criterion utilizing the received at least one utterance, the DNN feature transformation comprising a plurality of hidden layers;
  
  generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance;
  
  utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer;
  
  extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction;
  
  generating, by the computing device, a plurality of scores from a first ASR system and a second ASR system in the plurality of ASR systems; and
  
  combining, by the computing device, the plurality of scores from the first ASR system and the second ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising, utilizing the feature dimension reduction to train a model following the extraction of the one or more weights between the top hidden layer and the bottleneck layer.
  - 3. The method of claim 1, further comprising training senone dependent combination coefficients with the criterion.
  - 4. The method of claim 1, wherein receiving, by a computing device, at least one utterance comprises receiving a plurality of training utterances for speech recognition.
  - 5. The method of claim 1, wherein training, by the computing device, a DNN feature transformation with a criterion utilizing the received at least one utterance, the DNN feature transformation comprising a plurality of hidden layers, comprises:
    - training the first ASR system with a cross entropy criterion, the first ASR system comprising a DNN system;
      
      freezing the plurality of hidden layers; and
      
      deriving the DNN feature transformation from a top hidden layer of the DNN system.
  - 6. The method of claim 1, wherein training, by the computing device, a DNN feature transformation with a criterion utilizing the received at least one utterance, the DNN feature transformation comprising a plurality of hidden layers, comprises:
    - training the first ASR system with a cross entropy criterion, the first ASR system comprising a DNN system;
      
      freezing the plurality of hidden layers; and
      
      deriving the DNN feature transformation from a top hidden layer of the DNN system.
  - 7. The method of claim 1, wherein utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer comprises generating a network comprising a low dimension bottleneck hidden layer and a plurality of senones.
  - 8. The method of claim 1, wherein generating, by the computing device, a plurality of scores from a first ASR system and a second ASR system in the plurality of ASR systems comprises generating a plurality of log likelihood scores from a Context Dependent-Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system.
  - 9. The method of claim 1, wherein combining, by the computing device, the plurality of scores from the first ASR system and the second ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation, comprises performing a linear combination of the plurality of scores from the first ASR system and the second ASR system.
  - 10. The method of claim 1, wherein combining, by the computing device, the plurality of scores from the first ASR system and the second ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation, comprises performing a non-linear combination of the plurality of scores from the first ASR system and the second ASR system.

11. A speech recognition system comprising:
- a DNN systemfor generating a DNN-derived feature;
  
  a plurality of back end systems for utilizing the DNN-derived feature; and
  
  a feature transformation for receiving a plurality of utterances, the feature transformation being generated by the DNN system and being shared by the plurality of back end systems, an output of the shared feature transformation being utilized by the plurality of back end systems to generate a single senone log likelihood for speech recognition.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The speech recognition system of claim 11, wherein the DNN system is utilized to generate a first log likelihood.
  - 13. The speech recognition system of claim 12, wherein the feature transformation comprises a deep neural network (DNN) feature transformation, the DNN feature transformation comprising a plurality of hidden layers.
  - 14. The speech recognition system of claim 13, wherein one of the plurality of back end systems utilizes a top hidden layer in the plurality of hidden layers to generate a network comprising a bottleneck layer and an output layer, the bottleneck layer being utilized to perform a dimension reduction for the output of the shared feature transformation.
  - 15. The speech recognition system of claim 14, wherein the one of the plurality of back end systems is utilized to generate a second log likelihood, the second likelihood being combined with the first log likelihood and a criterion to generate a senone log likelihood for the speech recognition system.

16. A computer-readable storage medium storing computer executable instructions which, when executed by a computer, will cause computer to perform a method of providing a framework for merging a plurality of automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation, the method comprising:
- receiving a plurality of training utterances for speech recognition;
  
  training a DNN system with one or more of a cross entropy criterion and a sequential training criterion utilizing the plurality of training utterances, the DNN feature transformation comprising a plurality of hidden layers;
  
  generating an output from a top hidden layer in the plurality of hidden layers for the plurality of training utterances;
  
  utilizing the top hidden layer output to generate a network comprising a low dimension bottleneck hidden layer and a plurality of senones;
  
  extracting one or more weights between the top hidden layer and the low dimension hidden bottleneck layer, the one or more weights representing a feature dimension reduction;
  
  utilizing the feature dimension reduction to train a model following the extraction of the one or more weights between the top hidden layer and the low dimension bottleneck hidden layer;
  
  generating a plurality of log likelihood scores from the DNN system and another ASR system in the plurality of ASR systems;
  
  combining the plurality of scores from the DNN system and the another ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation; and
  
  training senone dependent combination coefficients with the one or more of the cross entropy criterion and a sequential training criterion.
- View Dependent Claims (19, 20)
- - 19. The computer-readable storage medium of claim 16, wherein combining the plurality of scores from the DNN system and the another ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation, comprises performing a linear combination of the plurality of scores from the DNN system and the another ASR system.
  - 20. The computer-readable storage medium of claim 16, wherein combining the plurality of scores from the DNN system and the another ASR system to merge the plurality of ASR systems, the plurality of ASR systems sharing the DNN feature transformation, comprises performing a non-linear combination of the plurality of scores from the DNN system and the another ASR system.

18. The computer-readable storage medium of claim 18, wherein the DNN system comprises a Context Dependent-Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and the another ASR system comprises a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system.
- View Dependent Claims (17)
- - 17. The computer-readable storage medium of claim 18, wherein training a DNN feature transformation with one or more of a cross entropy criterion and a sequential training criterion utilizing the received plurality of training utterances, the DNN feature transformation comprising a plurality of hidden layers, comprises freezing the plurality of hidden layers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
LI, JINYU, XUE, JIAN, GONG, YIFAN

Granted Patent

US 9,520,127 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/16   using artificial neural net...

G10L 15/32   Multiple recognisers used i...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 25/30   using neural networks

SHARED HIDDEN LAYER COMBINATION FOR SPEECH RECOGNITION SYSTEMS

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SHARED HIDDEN LAYER COMBINATION FOR SPEECH RECOGNITION SYSTEMS

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links