Dense captioning with joint interference and visual context

US 10,198,671 B1
Filed: 11/10/2016
Issued: 02/05/2019
Est. Priority Date: 11/10/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

processing an image to produce a feature map of the image;

analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image;

cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image;

analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and

for each region feature of the plurality of region features of the image;

analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object;

generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and

localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A dense captioning system and method is provided for processing an image to produce a feature map of the image, analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image, analyzing the feature map to determine a plurality of region features of the image, and analyzing the feature map to determine a context feature for the image. For each region feature of the plurality of region features of the image, the dense captioning system further provides for analyzing the region feature to determine a detection score for the region feature, calculating a caption for a bounding box for a visual concept in the image using the region feature and the context feature, and localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.

89 Citations

View as Search Results

21 Claims

1. A method comprising:
- processing an image to produce a feature map of the image;
  
  analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image;
  
  cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image;
  
  analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and
  
  for each region feature of the plurality of region features of the image;
  
  analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object;
  
  generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and
  
  localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the feature map is produced using a fully convolutional neural network.
  - 3. The method of claim 1, wherein the proposed bounding boxes are generated using a region proposal network to predict visual concept locations and generate bounding boxes with a confidence of enclosing some visual concept in the image.
  - 4. The method of claim 1, wherein a visual concept comprises an object, an object part, an interaction between objects, a scene, or an event.
  - 5. The method of claim 1, wherein each cropped region from the feature map for each proposed bounding box undergoes an operation to generate a region feature.
  - 6. The method of claim 1, wherein region of interest (ROI) pooling is used to ensure that the dimensions of the region features are the same for all of the proposed bounding boxes.
  - 7. The method of claim 1, wherein the context feature is determined based on the entire feature map.
  - 8. The method of claim 1, wherein region of interest (ROI) pooling is used for the context feature.
  - 9. The method of claim 1, where the caption is generated and the visual concept is localized for a region feature of the plurality of region features only if the detection score for the region feature is above a predetermined threshold.
  - 10. The method of claim 1, further comprising:
    - storing the adjusted bounding box for the visual concept and the caption for the bounding box.
  - 11. The method of claim 1, wherein the caption is calculated using two Long Short Term Memories (LSTMs) to generate each word of the caption, wherein a first LSTM of the two LSTMs uses the region feature as an input, and a second LSTM of the two LSTMs uses the context feature as an input.
  - 12. The method of claim 11, wherein the output of the two LSTMs is fed into a fusion operator to generate a word for the caption.
  - 13. The method of claim 1, wherein the visual concept is localized using a Long Short Term Memory (LSTM) that takes a region feature of the plurality of region features for the image as an input and each word generated for the caption as an input.
  - 14. The method of claim 13, wherein the bounding box is adjusted around the visual concept for each word input in the LSTM and wherein the adjusted bounding box for the visual concept is generated after the final word of the caption.

15. A dense captioning system comprising:
- a processor; and
  
  a computer readable medium coupled with the processor, the computer readable medium comprising instructions stored thereon that are executable by the processor to cause a computing device to perform operations comprising;
  
  processing an image to produce a feature map of the image;
  
  analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image;
  
  cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image;
  
  analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and
  
  for each region feature of the plurality of region features of the image;
  
  analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object;
  
  generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and
  
  localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The dense captioning system of claim 15, wherein the caption is generated using two Long Short Term Memories (LSTMs) to generate each word of the caption, wherein the a first LSTM of the two LSTMs uses the region feature as an input, and a second LSTM of the two LSTMs uses the context feature as an input.
  - 17. The dense captioning system of claim 16, wherein the output of the two LSTMs is fed into a fusion operator to generate a word for the caption.
  - 18. The dense captioning system of claim 15, wherein the visual concept is localized using a Long Short Term memory (LSTM) that takes a region feature of the plurality of region features for the image as an input and each word generated for the caption as an input.
  - 19. The dense captioning system of claim 18, wherein the bounding box is adjusted around the visual concept for each word input in the LSTM and wherein the adjusted bounding box for the visual concept is generated after the final word of the caption.
  - 20. The dense captioning system of claim 18, wherein a visual concept comprises an object, an object part, an interaction between objects, a scene, or an event.

21. A non-transitory computer readable medium comprising instructions stored thereon that are executable by at least one processor to cause a computing device to perform operations comprising:
- processing an image to produce a feature map of the image;
  
  analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image;
  
  cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image;
  
  analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and
  
  for each region feature of the plurality of region features of the image;
  
  analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object;
  
  generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and
  
  localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Snap, Inc.
Original Assignee
Snap, Inc.
Inventors
Yang, Linjie, Tang, Kevin Dechau, Yang, Jianchao, Li, Jia
Primary Examiner(s)
Shin, Soo

Application Number

US15/348,501
Time in Patent Office

817 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 18/2411   based on the proximity to a...

G06F 18/24143   Distances to neighbourhood ...

G06T 11/60   Editing figures and text; C...

G06T 2210/12   Bounding box

G06T 7/11   Region-based segmentation

G06V 10/25   Determination of region of ...

G06V 10/764   using classification, e.g. ...

G06V 10/768   using context analysis, e.g...

G06V 10/82   using neural networks

G06V 20/20   in augmented reality scenes

G06V 20/70   Labelling scene content, e....

Dense captioning with joint interference and visual context

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

89 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Dense captioning with joint interference and visual context

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

89 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links