Object recognition from videos using recurrent neural networks
First Claim
Patent Images
1. A computer-implemented method, comprising:
- obtaining multiple frames from a video, wherein each frame of the multiple frames depicts an object to be recognized; and
processing, using an object recognition model, the multiple frames to generate data that represents a classification of the object to be recognized,wherein the object recognition model is a recurrent neural network that comprises a long short-term memory (LSTM) layer and multiple feature extraction layers,wherein the LSTM layer includes a convolutional input gate, a convolutional forget gate, a convolutional memory block, and a convolutional output gate that use convolutions to process data, and wherein the processing comprises, for each frame of the multiple frames;
processing, using the multiple feature extraction layers, the frame to generate feature data that represents features of the frame; and
processing, using the LSTM layer, the feature data to generate an LSTM output and to update an internal state of the LSTM layer.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying an object from a video. One of the methods includes obtaining multiple frames from a video, where each frame of the multiple frames depicts an object to be recognized, and processing, using an object recognition model, the multiple frames to generate data that represents a classification of the object to be recognized.
64 Citations
30 Claims
-
1. A computer-implemented method, comprising:
-
obtaining multiple frames from a video, wherein each frame of the multiple frames depicts an object to be recognized; and processing, using an object recognition model, the multiple frames to generate data that represents a classification of the object to be recognized, wherein the object recognition model is a recurrent neural network that comprises a long short-term memory (LSTM) layer and multiple feature extraction layers, wherein the LSTM layer includes a convolutional input gate, a convolutional forget gate, a convolutional memory block, and a convolutional output gate that use convolutions to process data, and wherein the processing comprises, for each frame of the multiple frames; processing, using the multiple feature extraction layers, the frame to generate feature data that represents features of the frame; and processing, using the LSTM layer, the feature data to generate an LSTM output and to update an internal state of the LSTM layer. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations comprising:
-
obtaining multiple frames from a video, wherein each frame of the multiple frames depicts an object to be recognized; and processing, using an object recognition model, the multiple frames to generate data that represents a classification of the object to be recognized, wherein the object recognition model is a recurrent neural network that comprises a long short-term memory (LSTM) layer and multiple feature extraction layers, wherein the LSTM layer includes a convolutional input gate, a convolutional forget gate, a convolutional memory block, and a convolutional output gate that use convolutions to process data, and wherein the processing comprises, for each frame of the multiple frames; processing, using the multiple feature extraction layers, the frame to generate feature data that represents features of the frame; and processing, using the LSTM layer, the feature data to generate an LSTM output and to update an internal state of the LSTM layer. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer program product encoded on one or more non-transitory computer storage media, the computer program product comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
-
obtaining multiple frames from a video, wherein each frame of the multiple frames depicts an object to be recognized; and processing, using an object recognition model, the multiple frames to generate data that represents a classification of the object to be recognized, wherein the object recognition model is a recurrent neural network that comprises a long short-term memory (LSTM) layer and multiple feature extraction layers, wherein the LSTM layer includes a convolutional input gate, a convolutional forget gate, a convolutional memory block, and a convolutional output gate that use convolutions to process data, and wherein the processing comprises, for each frame of the multiple frames; processing, using the multiple feature extraction layers, the frame to generate feature data that represents features of the frame; and processing, using the LSTM layer, the feature data to generate an LSTM output and to update an internal state of the LSTM layer. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification