Systems and methods for classifying activities captured within images
First Claim
1. A system for classifying activities captured within images, the system comprising:
- one or more physical processors configured by machine-readable instructions to;
access an image, the image including a visual capture of a scene;
process the image through a convolutional neural network, the convolutional neural network generating a set of two-dimensional feature maps based on the image;
process the set of two-dimensional feature maps through a contextual long short-term memory unit, the contextual long short-term memory unit generating a set of two-dimensional outputs based on the set of two-dimensional feature maps, wherein the contextual long short-term memory unit includes a loss function characterized by a non-overlapping loss, an entropy loss, and a cross-entropy loss and the non-overlapping loss, the entropy loss, and the cross-entropy loss are combined into the loss function through a linear combination with a first hyper parameter for the non-overlapping loss, a second hyper parameter for the entropy loss, and a third hyper parameter for the cross-entropy loss;
generate a set of attention-masks for the image based on the set of two-dimensional outputs and the set of two-dimensional feature maps, the set of attention-masks defining dimensional portions of the image; and
classify the scene based on the set of two-dimensional outputs.
5 Assignments
0 Petitions
Accused Products
Abstract
An image including a visual capture of a scene may be accessed. The image may be processed through a convolutional neural network. The convolutional neural network may generate a set of two-dimensional feature maps based on the image. The set of two-dimensional feature maps may be processed through a contextual long short-term memory unit. The contextual long short-term memory unit may generate a set of two-dimensional outputs based on the set of two-dimensional feature maps. A set of attention-masks for the image may be generated based on the set of two-dimensional outputs and the set of two-dimensional feature maps. The set of attention-masks may define dimensional portions of the image. The scene may be classified based on the two-dimensional outputs.
172 Citations
16 Claims
-
1. A system for classifying activities captured within images, the system comprising:
one or more physical processors configured by machine-readable instructions to; access an image, the image including a visual capture of a scene; process the image through a convolutional neural network, the convolutional neural network generating a set of two-dimensional feature maps based on the image; process the set of two-dimensional feature maps through a contextual long short-term memory unit, the contextual long short-term memory unit generating a set of two-dimensional outputs based on the set of two-dimensional feature maps, wherein the contextual long short-term memory unit includes a loss function characterized by a non-overlapping loss, an entropy loss, and a cross-entropy loss and the non-overlapping loss, the entropy loss, and the cross-entropy loss are combined into the loss function through a linear combination with a first hyper parameter for the non-overlapping loss, a second hyper parameter for the entropy loss, and a third hyper parameter for the cross-entropy loss; generate a set of attention-masks for the image based on the set of two-dimensional outputs and the set of two-dimensional feature maps, the set of attention-masks defining dimensional portions of the image; and classify the scene based on the set of two-dimensional outputs. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
8. A method for classifying activities captured within images, the method comprising:
-
accessing an image, the image including a visual capture of a scene; processing the image through a convolutional neural network, the convolutional neural network generating a set of two-dimensional feature maps based on the image; processing the set of two-dimensional feature maps through a contextual long short-term memory unit, the contextual long short-term memory unit generating a set of two-dimensional outputs based on the set of two-dimensional feature maps, wherein the contextual long short-term memory unit includes a loss function characterized by a non-overlapping loss, an entropy loss, and a cross-entropy loss and the non-overlapping loss, the entropy loss, and the cross-entropy loss are combined into the loss function through a linear combination with a first hyper parameter for the non-overlapping loss, a second hyper parameter for the entropy loss, and a third hyper parameter for the cross-entropy loss; generating a set of attention-masks for the image based on the set of two-dimensional outputs and the set of two-dimensional feature maps, the set of attention-masks defining dimensional portions of the image; and classifying the scene based on the set of two-dimensional outputs. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A system for classifying activities captured within images, the system comprising:
one or more physical processors configured by machine-readable instructions to; access an image, the image including a visual capture of a scene; process the image through a convolutional neural network, the convolutional neural network generating a set of two-dimensional feature maps based on the image; process the set of two-dimensional feature maps through a contextual long short-term memory unit, the contextual long short-term memory unit generating a set of two-dimensional outputs based on the set of two-dimensional feature maps, wherein; the contextual long short-term memory unit includes a loss function characterized by a non-overlapping loss, an entropy loss, and a cross-entropy loss; and the non-overlapping loss, the entropy loss, and the cross-entropy loss are combined into the loss function through a linear combination with a first hyper parameter for the non-overlapping loss, a second hyper parameter for the entropy loss, and a third hyper parameter for the cross-entropy loss; generate a set of attention-masks for the image based on the set of two-dimensional outputs and the set of two-dimensional feature maps, the set of attention-masks defining dimensional portions of the image, wherein the loss function discourages the set of attention masks defining a same dimensional portion of the image across multiple time-steps; and classify the scene based on the set of two-dimensional outputs. - View Dependent Claims (16)
Specification