Shared hidden layer combination for speech recognition systems

US 9,520,127 B2
Filed: 04/29/2014
Issued: 12/13/2016
Est. Priority Date: 04/29/2014
Status: Active Grant

First Claim

Patent Images

1. A method of providing a framework for merging two or more automatic speech recognition (ASR) system having a shared deep neural network (DNN) feature transformation, comprising:

receiving, by a computing device, at least one utterance;

training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprising a plurality of hidden layers;

generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance;

utilizing, by the computing device, the top hidden later output to generate a network comprising a bottleneck layer and an output layer;

extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction;

generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR;

combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and

training, for the merged system, senone coefficient data for evaluation of spoken utterances.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation.

90 Citations

View as Search Results

20 Claims

1. A method of providing a framework for merging two or more automatic speech recognition (ASR) system having a shared deep neural network (DNN) feature transformation, comprising:
- receiving, by a computing device, at least one utterance;
  
  training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprising a plurality of hidden layers;
  
  generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance;
  
  utilizing, by the computing device, the top hidden later output to generate a network comprising a bottleneck layer and an output layer;
  
  extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction;
  
  generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR;
  
  combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and
  
  training, for the merged system, senone coefficient data for evaluation of spoken utterances.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising receiving a spoken utterance, and executing ASR recognition for the spoken utterance using the merged system.
  - 3. The method of claim 2, wherein the senone coefficient data is used to evaluate the spoken utterance to determine ASR results.
  - 4. The method of claim 1, wherein receiving, by a computing device, at least one utterance comprises receiving a plurality of training utterances for speech recognition.
  - 5. The method of claim 1, wherein the training of the at least one utterance comprises:
    - training the first ASR system with a cross entropy criterion, the first ASR system comprising a DNN system; and
      
      deriving the DNN feature transformation from a top hidden layer of the DNN system.
  - 6. The method of claim 1, wherein the training of the at least one utterance comprises:
    - training the first ASR system with sequential training criterion, the first ASR system comprising a DNN system; and
      
      deriving the DNN feature transformation from a top hidden layer of the DNN system.
  - 7. The method of claim 1, wherein utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer comprises generating a network comprising a low dimension bottleneck hidden layer and a plurality of senones.
  - 8. The method of claim 1, wherein generating, by the computing device, the first score and the second score comprises generating log likelihood scores from a Context Dependent Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system.
  - 9. The limitation of claim 1, wherein combining, by the computing device, the first score and the second score comprises performing a linear combination of the first score from the first ASR system and the second score from the second ASR system.
  - 10. The method of claim 1, wherein combining, by the computing device, the first score and the second score comprises performing a non-linear combination of the first score from the first ASR system and the second score from the second ASR system.

11. A system comprising:
- at least one processor; and
  
  a memory operatively connected with the at least one processor, wherein the memory stores computer-executable instructions that, when executed by the at least one processor, causes the at least one processor to execute a method that comprises;
  
  receiving, by a computing device, at least one utterance;
  
  training, by the computing device, the at least one utterance using a DNN feature transformation with a criterion, wherein the DNN feature transformation comprises a plurality of hidden layers;
  
  generating, by the computing device, an output from a top hidden layer in the plurality of hidden layers for the at least one utterance;
  
  utilizing, by the computing device, the top hidden layer output to generate a network comprising a bottleneck layer and an output layer;
  
  extracting, by the computing device, one or more weights between the top hidden layer and the bottleneck layer, the one or more weights representing a feature dimension reduction;
  
  generating, by the computing device, a first score from a first ASR system based on application of the feature dimension reduction to a model of the first ASR system and generating a second score from a second ASR system based on application of the feature dimension reduction to a model of the second ASR system;
  
  combining, by the computing device, the first score and the second score to merge the first ASR system and the second ASR system to create a merged system; and
  
  training, for the merged system, senone coefficient data for evaluation of spoken utterances.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system according to claim 11, wherein the method, executed by the at least one processor, further comprises receiving a spoken utterance, and executing ASR recognition for the spoken utterance using the merged system.
  - 13. The system according to claim 11, wherein the training of the at least one utterance comprises:
    - training the first ASR system with at least one of a cross entropy criterion and a sequential training criterion, and deriving the DNN feature transformation from a top hidden layer of a DNN system.
  - 14. The system according to claim 11, wherein the generating of the first score and the second score further comprises generating log likelihood scores from a Context Dependent Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system.
  - 15. The system according to claim 11, wherein the combining of the first score and the second score occurs by executing at least one selected from a group consisting of:
    - performing a non-linear combination of the first score and the second score, and performing a linear combination of the first score and the second score.

16. A computer-readable storage device storing computer executable instructions which, when executed by a computer, cause computer to perform a method of providing a framework for merging systems having a shared deep neural network (DNN) feature transformation, the method comprising:
- receiving a plurality of training utterances for speech recognition;
  
  training a first system with one or more of a cross entropy criterion and a sequential training criterion utilizing the plurality of training utterances, the DNN feature transformation comprising a plurality of hidden layers;
  
  generating an output from a top hidden layer in the plurality of hidden layers for the plurality of training utterances;
  
  utilizing the top hidden layer output to generate a network comprising a low dimension bottleneck hidden layer and a plurality of senones;
  
  extracting one or more weights between the top hidden layer and the low dimension hidden bottleneck layer, the one or more weights representing a feature dimension reduction;
  
  utilizing the feature dimension reduction to train a model for a second system following the extraction of the one or more weights between the top hidden layer and the low dimension bottleneck hidden layer;
  
  generating a first log likelihood score from the first system based on application of the feature dimension reduction to the first system and a second log likelihood score from the second system based on application of the feature dimension to the model of the second system;
  
  combining the first log likelihood score and the second log likelihood score to create a merged system from the first system and the second system, wherein the first system and the second system share the DNN feature transformation; and
  
  training senone dependent combination coefficients from the merged system with the one or more of the cross entropy criterion and the sequential training criterion.
- View Dependent Claims (18, 19, 20)
- - 18. The computer-readable storage device of claim 16, wherein the first system is a Context Dependent Deep Neural Network-Hidden Markov Model (CD-DNN-HMM) system and the second system is a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) system.
  - 19. The computer-readable storage device of claim 16, wherein combining of the first log likelihood score and the second log likelihood score comprises performing a linear combination of the first log likelihood score from the first system and the second log likelihood score from the second system.
  - 20. The computer-readable storage device of claim 16, wherein combining of the first log likelihood score and the second log likelihood score comprises performing a non-linear combination of the first log likelihood score from the first system and the second log likelihood score from the second system.

17. The computer-readable storage medium, wherein the method further comprising receiving a spoken utterance, processing the spoken utterance using the senone dependent combination coefficients, and outputting automatic speech recognition (ASR) results data based on the processing of the spoken utterances.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Li, Jinyu, Xue, Jian, Gong, Yifan
Primary Examiner(s)
Hudspeth, David
Assistant Examiner(s)
OGUNBIYI, OLUWADAMILOL M

Application Number

US14/265,110
Publication Number

US 20150310858A1
Time in Patent Office

959 Days
Field of Search

704/232
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/063   Training

G10L 15/16   using artificial neural net...

G10L 15/32   Multiple recognisers used i...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 25/30   using neural networks

Shared hidden layer combination for speech recognition systems

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

90 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Shared hidden layer combination for speech recognition systems

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

90 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others