Full-sequence training of deep structures for speech recognition
First Claim
Patent Images
1. A method comprising the following computer-executable acts:
- accessing a deep belief network (DBN) retained in computer-readable data storage, wherein the DBN comprises;
a plurality of stacked hidden layers, each hidden layer comprises a respective plurality of stochastic units, each stochastic unit in each layer connected to stochastic units in an adjacent hidden layer of the DBN by way of connections, the connections assigned weights learned during a pretraining procedure; and
a linear-chain conditional random field (CRF), the CRF comprises;
a hidden layer that comprises a plurality of stochastic units; and
a plurality of output units that are representative of output states, each state in the output states being one of a phone or senone, the plurality of stochastic units connected to the plurality of output units by way of second connections, the second connections having weights learned during the pretraining procedure, the output units have transition probabilities corresponding thereto that are indicative of probabilities of transitioning between output states represented by the output units; and
jointly optimizing the weights assigned to the connections, the weights assigned to the second connections, the transition probabilities, and language model scores of the DBN based upon training data, wherein a processor performs the jointly optimizing of the weights.
3 Assignments
0 Petitions
Accused Products
Abstract
A method includes an act of causing a processor to access a deep-structured model retained in a computer-readable medium, the deep-structured model includes a plurality of layers with respective weights assigned to the plurality of layers, transition probabilities between states, and language model scores. The method further includes the act of jointly substantially optimizing the weights, the transition probabilities, and the language model scores of the deep-structured model using the optimization criterion based on a sequence rather than a set of unrelated frames.
-
Citations
20 Claims
-
1. A method comprising the following computer-executable acts:
-
accessing a deep belief network (DBN) retained in computer-readable data storage, wherein the DBN comprises; a plurality of stacked hidden layers, each hidden layer comprises a respective plurality of stochastic units, each stochastic unit in each layer connected to stochastic units in an adjacent hidden layer of the DBN by way of connections, the connections assigned weights learned during a pretraining procedure; and a linear-chain conditional random field (CRF), the CRF comprises; a hidden layer that comprises a plurality of stochastic units; and a plurality of output units that are representative of output states, each state in the output states being one of a phone or senone, the plurality of stochastic units connected to the plurality of output units by way of second connections, the second connections having weights learned during the pretraining procedure, the output units have transition probabilities corresponding thereto that are indicative of probabilities of transitioning between output states represented by the output units; and jointly optimizing the weights assigned to the connections, the weights assigned to the second connections, the transition probabilities, and language model scores of the DBN based upon training data, wherein a processor performs the jointly optimizing of the weights. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented system comprising:
-
a processor; and a memory that comprises a plurality of components that are executable by the processor, the components comprising; a receiver component that receives a pretrained deep belief network (DBN), wherein the DBN comprises a plurality of hidden layers, weights between the hidden layers, a linear conditional random field (CRF) that comprises output units that each represent possible output states, transition probabilities between output units, and language model scores, the transition probabilities representative of probabilities of transitioning between output states represented by the output units, each output state being one of a phone or senone; and a trainer component that jointly optimizes weights of the pretrained DBN, the transition probabilities of the pretrained DBN, and language model scores of the pretrained DBN based upon a set of training data. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A computing device comprising a computer-readable medium, the computer-readable medium comprising instructions that, when executed by a processor, cause the processor to perform acts comprising:
-
greedily learning weights between hidden layers of a deep belief network (DBN) that is configured for employment in an automatic speech recognition (ASR) system, wherein the DBN is temporally parameter-tied and an uppermost layer in the DBN is a linear-chain conditional random field (CRF), the linear chain CRF comprises a plurality of output units that are representative of respective output states, each output state being one of a phone or senone; providing training data to the DBN to optimize a log of conditional probabilities of output sequences of the DBN, an output sequence comprising a sequence of output states represented by the output units; and jointly optimizing the weights between the DBN, transition probabilities between the output units in the CRF, and language model scores in the DBN based upon the log of the conditional probabilities of output sequences produced by the DBN. - View Dependent Claims (17, 18, 19, 20)
-
Specification