SYSTEMS AND METHODS FOR ATTENTION-BASED CONFIGURABLE CONVOLUTIONAL NEURAL NETWORKS (ABC-CNN) FOR VISUAL QUESTION ANSWERING

US 20170124432A1
Filed: 06/16/2016
Published: 05/04/2017
Est. Priority Date: 11/03/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of improving accuracy in generating an answer to a question input related to an image input, the method comprising:

receiving an image input;

receiving a question input related to the image input;

inputting the question input and the image input into an Attention-Based Configurable Convolutional Neural Networks (ABC-CNN) framework to generate an answer, the ABC-CNN framework comprising;

an image feature map extraction component comprising a CNN that extracts an image feature map from the image input;

a semantic question embedding component that obtains question embeddings from the question input;

a question-guided attention map generation component that receives the image feature map and the question embeddings and that obtains a question-guided attention map focusing on a region or regions asked by question input; and

an answer generation component that obtains an attention weighted image feature map by weighting image feature map using the question-guided attention map and generates answers based on a fusion of the image feature map, the question embeddings, and the attention weighted image feature map.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described herein are systems and methods for generating and using attention-based deep learning architectures for visual question answering task (VQA) to automatically generate answers for image-related (still or video images) questions. To generate the correct answers, it is important for a model'"'"'s attention to focus on the relevant regions of an image according to the question because different questions may ask about the attributes of different image regions. In embodiments, such question-guided attention is learned with a configurable convolutional neural network (ABC-CNN). Embodiments of the ABC-CNN models determine the attention maps by convolving image feature map with the configurable convolutional kernels determined by the questions semantics. In embodiments, the question-guided attention maps focus on the question-related regions and filters out noise in the unrelated regions.

Citations

20 Claims

1. A computer-implemented method of improving accuracy in generating an answer to a question input related to an image input, the method comprising:
- receiving an image input;
  
  receiving a question input related to the image input;
  
  inputting the question input and the image input into an Attention-Based Configurable Convolutional Neural Networks (ABC-CNN) framework to generate an answer, the ABC-CNN framework comprising;
  
  an image feature map extraction component comprising a CNN that extracts an image feature map from the image input;
  
  a semantic question embedding component that obtains question embeddings from the question input;
  
  a question-guided attention map generation component that receives the image feature map and the question embeddings and that obtains a question-guided attention map focusing on a region or regions asked by question input; and
  
  an answer generation component that obtains an attention weighted image feature map by weighting image feature map using the question-guided attention map and generates answers based on a fusion of the image feature map, the question embeddings, and the attention weighted image feature map.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1 wherein the semantic question embedding part comprises a long short term memory (LSTM) layer to generate the question embeddings to characterize semantic meanings of the question input.
  - 3. The computer-implemented method of claim 1 wherein the question-guided attention map generation part comprises configurable convolutional kernels produced by projecting the question embeddings from a semantic space into a visual space and utilized to convolve with the image feature map to produce the question-guided attention map.
  - 4. The computer-implemented method of claim 3 wherein the convolutional kernels have the same number of channels as the image feature map.
  - 5. The computer-implemented method of claim 3 wherein the question-guided attention map has the same size as the image feature map.
  - 6. The computer-implemented method of claim 1 wherein the image feature map is extracted by dividing the image input into a plurality of grids, and extracting a D-dimension feature vector in each cell of the grids.
  - 7. The computer-implemented method of claim 1 wherein the image feature map is spatially weighted by the question-guided attention map to obtain the attention weighted image feature map.
  - 8. The computer-implemented method of claim 7 wherein the spatial weighting is achieved by element-wise production between each channel of the image feature map and the question-guided attention map.
  - 9. The computer-implemented method of claim 8 wherein the spatial weighting is further defined by softmax normalization for a spatial attention distribution.
  - 10. The computer-implemented method of claim 1 wherein the ABC-CNN framework is pre-trained in an end-to-end way with stochastic gradient descent.
  - 11. The question-guided attention-based deep learning method of claim 10 wherein the ABC-CNN framework has initialization weights randomly adjusted to ensure that each dimension of the activations of all layers within the ABC-CNN framework has zero mean and one standard derivation during pre-training.

12. A computer-implemented method of generating an answer to a question related to an image, the method comprising steps of:
- extracting an image feature map from an input image comprising a plurality of pixels using a deep convolutional neural network;
  
  obtaining a dense question embedding from an input question related to the input image using a long short term memory (LSTM) layer;
  
  producing a plurality of question-configured kernels by projecting the dense question embedding from semantic space into visual space;
  
  convolving the question-configured kernels with the image feature map to generate a question-guided attention map;
  
  obtaining at a multi-class classifier an attention weighted image feature map by spatially weighting the image feature map using the question-guided attention map, the attention weighted feature map lowering weights of regions irrelevant to the question; and
  
  generating an answer to the question based on a fusion of the image feature map, the deep question embedding, and the attention weighted image feature map.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The method of claim 12 wherein the spatial weighting is achieved by element-wise production between each channel of the image feature map and the question-guided attention map.
  - 14. The method of claim 12 wherein the question-guided attention map adaptively represents each pixel'"'"'s degree of attention according to the input question.
  - 15. The method of claim 12 wherein the question-guided attention map is obtained by applying the question-configured kernels on the image feature map.
  - 16. The method of claim 12 wherein the image feature map, the deep question embedding, and the attention weighted image feature map are fused by a nonlinear projection.
  - 17. The method of claim 16 wherein the nonlinear projection is an element-wise scaled hyperbolic tangent function.

18. A non-transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, causes the steps to be performed comprising:
- responsive to receiving a question input, extracting a dense question embedding of the question input;
  
  responsive to receiving an image input related to the question input, generating an image feature map;
  
  generating a question-guided attention map based on at least the image feature map and the dense question embedding, the question-guided attention map selectively focusing on areas queried by the question input;
  
  spatially weighting the image feature map using the question-guided attention map to obtain an attention weighted image; and
  
  fusing semantic information, the image feature map, and the attention weighted image to generate an answer to the question input.
- View Dependent Claims (19, 20)
- - 19. The non-transitory computer-readable medium or media of claim 18 wherein generating a question-guided attention map further comprises softmax normalization a spatial attention distribution of the attention map.
  - 20. The non-transitory computer-readable medium or media of claim 19 wherein generating a question-guided attention map comprises configuring a set of convolutional kernels according to the dense question embedding and applying the convolutional kernels on the image feature map to generate question-guided attention map.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Baidu USA LLC (Baidu Incorporated)
Original Assignee
Baidu USA LLC (Baidu Incorporated)
Inventors
Wang, Jiang, Xu, Wei, Chen, Kan

Granted Patent

US 9,965,705 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 18/214   Generating training pattern...

G06F 18/24   Classification techniques

G06F 40/30   Semantic analysis

G06N 3/02   Neural networks

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 5/04   Inference or reasoning models

G06T 1/60   Memory management

SYSTEMS AND METHODS FOR ATTENTION-BASED CONFIGURABLE CONVOLUTIONAL NEURAL NETWORKS (ABC-CNN) FOR VISUAL QUESTION ANSWERING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEMS AND METHODS FOR ATTENTION-BASED CONFIGURABLE CONVOLUTIONAL NEURAL NETWORKS (ABC-CNN) FOR VISUAL QUESTION ANSWERING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links