FULL-SEQUENCE TRAINING OF DEEP STRUCTURES FOR SPEECH RECOGNITION

US 20120072215A1
Filed: 09/21/2010
Published: 03/22/2012
Est. Priority Date: 09/21/2010
Status: Active Grant

First Claim

Patent Images

1. A method comprising the following computer-executable acts:

causing a processor to access a deep-structured model retained in a computer-readable medium, wherein the deep-structured model comprises a plurality of layers with weights assigned thereto, transition probabilities between states, and language model scores; and

jointly optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is disclosed herein that include an act of causing a processor to access a deep-structured model retained in a computer-readable medium, wherein the deep-structured model comprises a plurality of layers with weights assigned thereto, transition probabilities between states, and language model scores. The method can further include the act of jointly substantially optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.

Citations

20 Claims

1. A method comprising the following computer-executable acts:
- causing a processor to access a deep-structured model retained in a computer-readable medium, wherein the deep-structured model comprises a plurality of layers with weights assigned thereto, transition probabilities between states, and language model scores; and
  
  jointly optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the deep-structured model is a deep belief network (DBN).
  - 3. The method of claim 2, wherein the DBN is configured to output one of a phone or senone.
  - 4. The method of claim 2, wherein the DBN is configured to perform one of automatic speech recognition, automatic gesture recognition, automatic human body action recognition, or automatic online handwriting recognition.
  - 5. The method of claim 2, wherein the DBN is a probabilistic generative model that comprises multiple layers of stochastic hidden units above a single bottom layer of observed variables that represent a data vector.
  - 6. The method of claim 1 configured to execute in a mobile computing apparatus.
  - 7. The method of claim 1, wherein the deep-structured model comprises a plurality of hidden stochastic layers, and further comprising pretraining the deep-structured model, wherein pretraining comprises utilizing an unsupervised algorithm to initialize weights of connections between the hidden stochastic layers.
  - 8. The method of claim 7, further comprising utilizing back-propagation to jointly substantially optimize the weights, the transition probabilities, and the language model scores of the deep-structured model.
  - 9. The method of claim 7, wherein pretraining comprises treating pairs of layers in the deep-structured model as a Restricted Boltzmann Machine.
  - 10. The method of claim 1, wherein the deep-structured model is a deep hidden conditional random field (DHCRF).

11. A computer-implemented system comprising:
- a processor; and
  
  a memory that comprises a plurality of components that are executable by the processor, the components comprising;
  
  a receiver component that receives a pretrained deep-structured model, wherein the deep-structured model comprises a plurality of layers, weights between the layers, transition parameters, and language model scores; and
  
  a trainer component that jointly substantially optimize weights of the pretrained deep-structured model, state transition parameters of the pretrained deep-structured model, and language model scores of the pretrained deep-structured model.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The system of claim 11, wherein the pretrained deep-structured model is trained for speech recognition.
  - 13. The system of claim 11, wherein the pretrained deep-structured model is a deep belief network (DBN).
  - 14. The system of claim 13, wherein the DBN is a probabilistic generative model that comprises multiple layers of stochastic hidden units above a single bottom layer of observed variables that represent a data vector.
  - 15. The system of claim 13, wherein the top-most layer of the DBN is a linear-chain conditional random field (CRF).
  - 16. The system of claim 11, wherein the pretrained deep-structured model is configured to output one of a phone or senone.
  - 17. The system of claim 11, wherein the components further comprise an initializer component that initializes weights of a deep-structured model to generate the pretrained deep-structured model.
  - 18. The system of claim 17, wherein the initializer component greedily learns layers of the deep-structured model to generate the pretrained deep-structured model.
  - 19. The system of claim 11, wherein the trainer component determines a conditional probability of a full-sequence of labels of the deep-structured model in connection with substantially optimizing the weights, transition parameters, and language model scores to generate the trained deep-structured model.

20. A non-transitory computer-readable medium comprising instructions that, when executed by a processor, cause the processor to perform acts comprising:
- greedily learning each layer of a deep belief network (DBN) that is configured for employment in an automatic speech recognition (ASR) system, wherein the DBN is temporally parameter-tied;
  
  providing training data to the DBN to optimize the log of the conditional probabilities of output states of the DBN; and
  
  jointly optimizing weights in the DBN, transition probabilities in a top layer of the DBN, and language model scores in the DBN based at least in part upon the log of the conditional probabilities of output sequences produced by the DBN.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Yu, Dong, Deng, Li, Mohamed, Abdel-rahman Samir Abdel-rahman

Granted Patent

US 9,031,844 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/240
CPC Class Codes

G06F 18/29   Graphical models, e.g. Baye...

G06N 3/045   Combinations of networks

G06N 3/084   Backpropagation, e.g. using...

G10L 15/14   using statistical models, e...

FULL-SEQUENCE TRAINING OF DEEP STRUCTURES FOR SPEECH RECOGNITION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

FULL-SEQUENCE TRAINING OF DEEP STRUCTURES FOR SPEECH RECOGNITION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links