Deep learning internal state index-based search and classification
First Claim
1. A non-transitory computer-readable medium comprising instructions for:
- providing a trained speech recognition neural network, the speech recognition neural network including a plurality of layers each having a plurality of nodes;
transcribing speech audio by the speech recognition neural network;
while the speech recognition neural network is transcribing the speech audio, generating one or more feature representations from a subset of nodes, the one or more feature representations representing an internal state of the speech recognition neural network at a plurality of timestamps during transcription, each of the feature representations comprising a vector of quantized values where each of the quantized values is obtained by quantizing an activation output of a node in the subset of nodes;
storing the one or more feature representations;
receiving a first set of classifications for a first portion of the speech audio;
training a classification model, the classification model being different than the trained speech recognition neural network, on a first set of feature representations corresponding to the first portion of the speech audio and the first set of classifications, the first set of feature representations comprising a first subset of the feature representations generated during the speech audio transcription; and
determining a second set of classifications for a second portion of the speech audio by inputting a second set of feature representations corresponding to the second portion of the speech audio into the trained classification model, the second set of feature representations comprising a second subset of the feature representations generated during the speech audio transcription.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are disclosed for generating internal state representations of a neural network during processing and using the internal state representations for classification or search. In some embodiments, the internal state representations are generated from the output activation functions of a subset of nodes of the neural network. The internal state representations may be used for classification by training a classification model using internal state representations and corresponding classifications. The internal state representations may be used for search, by producing a search feature from an search input and comparing the search feature with one or more feature representations to find the feature representation with the highest degree of similarity.
106 Citations
20 Claims
-
1. A non-transitory computer-readable medium comprising instructions for:
-
providing a trained speech recognition neural network, the speech recognition neural network including a plurality of layers each having a plurality of nodes; transcribing speech audio by the speech recognition neural network; while the speech recognition neural network is transcribing the speech audio, generating one or more feature representations from a subset of nodes, the one or more feature representations representing an internal state of the speech recognition neural network at a plurality of timestamps during transcription, each of the feature representations comprising a vector of quantized values where each of the quantized values is obtained by quantizing an activation output of a node in the subset of nodes; storing the one or more feature representations; receiving a first set of classifications for a first portion of the speech audio; training a classification model, the classification model being different than the trained speech recognition neural network, on a first set of feature representations corresponding to the first portion of the speech audio and the first set of classifications, the first set of feature representations comprising a first subset of the feature representations generated during the speech audio transcription; and determining a second set of classifications for a second portion of the speech audio by inputting a second set of feature representations corresponding to the second portion of the speech audio into the trained classification model, the second set of feature representations comprising a second subset of the feature representations generated during the speech audio transcription. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A non-transitory computer-readable medium comprising instructions for:
-
providing a trained speech recognition neural network, the speech recognition neural network including a plurality of layers each having a plurality of nodes; performing inference by the speech recognition neural network on input data, wherein the input data comprises speech audio and performing inference comprises transcribing the speech audio; while the speech recognition neural network is performing inference on the input data, generating one or more feature representations from a subset of nodes, the one or more feature representations representing an internal state of the speech recognition neural network at a plurality of timesteps, each of the feature representations comprising a vector of quantized values where each of the quantized values is obtained by quantizing an activation output of a node in the subset of nodes; storing the one or more feature representations; receiving a first set of classifications for a first portion of the input data; training a classification model, the classification model being different than the trained speech recognition neural network, on a first set of feature representations corresponding to the first portion of the input data and the first set of classifications, each of the first set of feature representations comprising one of the vectors of quantized values generated during inference on the first portion of the input data; and determining a second set of classifications for a second portion of the input data by inputting a second set of feature representations corresponding to the second portion of the input data into the trained classification model, each of the second set of feature representations comprising one of the vectors of quantized values generated during inference on the second portion of the input data. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A non-transitory computer-readable medium comprising instructions for:
-
providing a trained speech recognition neural network, the speech recognition neural network including a plurality of layers each having a plurality of nodes; transcribing speech audio by the speech recognition neural network; while the speech recognition neural network is transcribing the speech audio, generating one or more feature representations from a subset of nodes, the one or more feature representations representing an internal state of the speech recognition neural network at a plurality of timestamps during transcription, each of the feature representations comprising a vector of quantized values where each of the quantized values is obtained by quantizing an activation output of a node in the subset of nodes; storing the one or more feature representations; training a second neural network to generate a search feature based on a text input, the search feature comprising a vector of values, wherein the training is performed based on one or more training examples, each training example comprising a portion of the speech audio and a corresponding subset of the one or more feature representations generated during the speech audio transcription, and wherein the second neural network is different than the trained speech recognition neural network; receiving a text query and inputting the text query to the second neural network to generate the search feature; determining a similarity between the search feature and each of the one or more feature representations; selecting a feature representation with the greatest similarity with the search feature; and outputting an indication of a portion of the speech audio corresponding to the feature representation with the greatest similarity with the search feature. - View Dependent Claims (13, 14, 15)
-
-
12. The non-transitory computer-readable medium of 11, wherein the one or more feature representations comprise thresholded output values of the subset of nodes.
-
16. A non-transitory computer-readable medium comprising instructions for:
-
providing a trained speech recognition neural network, the speech recognition neural network including a plurality of layers each having a plurality of nodes; performing inference by the speech recognition neural network on input data, wherein the input data comprises speech audio and performing inference comprises transcribing the speech audio; while the speech recognition neural network is performing inference on the input data, generating one or more feature representations from a subset of nodes, the one or more feature representations representing an internal state of the speech recognition neural network at a plurality of timesteps, each of the feature representations comprising a vector of quantized values where each of the quantized values is obtained by quantizing an activation output of a node in the subset of nodes; storing the one or more feature representations; training a second neural network to generate a search feature based on a text input, the search feature comprising a vector of values, wherein the training is performed based on one or more training examples, each training example comprising a portion of the speech audio and the corresponding one or more vectors of quantized values generated during the transcription of the portion of the speech audio, and wherein the second neural network is different than the trained speech recognition neural network; receiving a text query and inputting the text query to the second neural network to generate the search feature; determining a similarity between the search feature and each of the one or more feature representations; selecting a feature representation with the greatest similarity with the search feature; and outputting an indication of a portion of the speech audio corresponding to the feature representation with the greatest similarity with the search feature. - View Dependent Claims (18, 19, 20)
-
-
17. The non-transitory computer-readable medium of 16, wherein the one or more feature representations comprise thresholded output values of the subset of nodes.
Specification