SYSTEMS AND METHODS FOR ATTENTION-BASED CONFIGURABLE CONVOLUTIONAL NEURAL NETWORKS (ABC-CNN) FOR VISUAL QUESTION ANSWERING
First Claim
1. A computer-implemented method of improving accuracy in generating an answer to a question input related to an image input, the method comprising:
- receiving an image input;
receiving a question input related to the image input;
inputting the question input and the image input into an Attention-Based Configurable Convolutional Neural Networks (ABC-CNN) framework to generate an answer, the ABC-CNN framework comprising;
an image feature map extraction component comprising a CNN that extracts an image feature map from the image input;
a semantic question embedding component that obtains question embeddings from the question input;
a question-guided attention map generation component that receives the image feature map and the question embeddings and that obtains a question-guided attention map focusing on a region or regions asked by question input; and
an answer generation component that obtains an attention weighted image feature map by weighting image feature map using the question-guided attention map and generates answers based on a fusion of the image feature map, the question embeddings, and the attention weighted image feature map.
1 Assignment
0 Petitions
Accused Products
Abstract
Described herein are systems and methods for generating and using attention-based deep learning architectures for visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. To generate the correct answers, it is important for a model'"'"'s attention to focus on the relevant regions of an image according to the question because different questions may ask about the attributes of different image regions. In embodiments, such question-guided attention is learned with a configurable convolutional neural network (ABC-CNN). Embodiments of the ABC-CNN models determine the attention maps by convolving image feature map with the configurable convolutional kernels determined by the questions semantics. In embodiments, the question-guided attention maps focus on the question-related regions and filters out noise in the unrelated regions.
-
Citations
20 Claims
-
1. A computer-implemented method of improving accuracy in generating an answer to a question input related to an image input, the method comprising:
-
receiving an image input; receiving a question input related to the image input; inputting the question input and the image input into an Attention-Based Configurable Convolutional Neural Networks (ABC-CNN) framework to generate an answer, the ABC-CNN framework comprising; an image feature map extraction component comprising a CNN that extracts an image feature map from the image input; a semantic question embedding component that obtains question embeddings from the question input; a question-guided attention map generation component that receives the image feature map and the question embeddings and that obtains a question-guided attention map focusing on a region or regions asked by question input; and an answer generation component that obtains an attention weighted image feature map by weighting image feature map using the question-guided attention map and generates answers based on a fusion of the image feature map, the question embeddings, and the attention weighted image feature map. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method of generating an answer to a question related to an image, the method comprising steps of:
-
extracting an image feature map from an input image comprising a plurality of pixels using a deep convolutional neural network; obtaining a dense question embedding from an input question related to the input image using a long short term memory (LSTM) layer; producing a plurality of question-configured kernels by projecting the dense question embedding from semantic space into visual space; convolving the question-configured kernels with the image feature map to generate a question-guided attention map; obtaining at a multi-class classifier an attention weighted image feature map by spatially weighting the image feature map using the question-guided attention map, the attention weighted feature map lowering weights of regions irrelevant to the question; and generating an answer to the question based on a fusion of the image feature map, the deep question embedding, and the attention weighted image feature map. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising:
-
responsive to receiving a question input, extracting a dense question embedding of the question input; responsive to receiving an image input related to the question input, generating an image feature map; generating a question-guided attention map based on at least the image feature map and the dense question embedding, the question-guided attention map selectively focusing on areas queried by the question input; spatially weighting the image feature map using the question-guided attention map to obtain an attention weighted image; and fusing semantic information, the image feature map, and the attention weighted image to generate an answer to the question input. - View Dependent Claims (19, 20)
-
Specification