USER SPECIFIED KEYWORD SPOTTING USING LONG SHORT TERM MEMORY NEURAL NETWORK FEATURE EXTRACTOR
First Claim
1. A method comprising:
- receiving, by a device for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the device determines that another audio signal encodes another spoken utterance of the enrollment phrase, the device performs a particular action assigned to the enrollment phrase; and
for each of the multiple variable length enrollment audio signals;
processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and
generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for recognizing keywords using a long short term memory neural network. One of the methods includes receiving, by a device for each of multiple variable length enrollment audio signals, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, processing each of the plurality of enrollment feature vectors using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector, and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether another audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal.
59 Citations
20 Claims
-
1. A method comprising:
-
receiving, by a device for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the device determines that another audio signal encodes another spoken utterance of the enrollment phrase, the device performs a particular action assigned to the enrollment phrase; and for each of the multiple variable length enrollment audio signals; processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the enrollment LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system comprising:
a computer and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving, by the computer for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the computer determines that another audio signal encodes another spoken utterance of the enrollment phrase, the computer performs a particular action assigned to the enrollment phrase; and for each of the multiple variable length enrollment audio signals; processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
15. A computer-readable medium storing software comprising instructions executable by a computer which, upon such execution, cause the computer to perform operations comprising:
-
receiving, by the computer for each of multiple variable length enrollment audio signals each encoding a respective spoken utterance of an enrollment phrase, a respective plurality of enrollment feature vectors that represent features of the respective variable length enrollment audio signal, wherein when the computer determines that another audio signal encodes another spoken utterance of the enrollment phrase, the computer performs a particular action assigned to the enrollment phrase; and for each of the multiple variable length enrollment audio signals; processing each of the plurality of enrollment feature vectors for the respective variable length enrollment audio signal using a long short term memory (LSTM) neural network to generate a respective enrollment LSTM output vector for each enrollment feature vector; and generating, for the respective variable length enrollment audio signal, a template fixed length representation for use in determining whether the other audio signal encodes another spoken utterance of the enrollment phrase by combining at most a quantity k of the LSTM output vectors for the enrollment audio signal, wherein a predetermined length of each of the template fixed length representations is the same. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification