Dense captioning with joint interference and visual context
First Claim
1. A method comprising:
- processing an image to produce a feature map of the image;
analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image;
cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image;
analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and
for each region feature of the plurality of region features of the image;
analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object;
generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and
localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.
1 Assignment
0 Petitions
Accused Products
Abstract
A dense captioning system and method is provided for processing an image to produce a feature map of the image, analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image, analyzing the feature map to determine a plurality of region features of the image, and analyzing the feature map to determine a context feature for the image. For each region feature of the plurality of region features of the image, the dense captioning system further provides for analyzing the region feature to determine a detection score for the region feature, calculating a caption for a bounding box for a visual concept in the image using the region feature and the context feature, and localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.
89 Citations
21 Claims
-
1. A method comprising:
-
processing an image to produce a feature map of the image; analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image; cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image; analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and for each region feature of the plurality of region features of the image; analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object; generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A dense captioning system comprising:
-
a processor; and a computer readable medium coupled with the processor, the computer readable medium comprising instructions stored thereon that are executable by the processor to cause a computing device to perform operations comprising; processing an image to produce a feature map of the image; analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image; cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image; analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and for each region feature of the plurality of region features of the image; analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object; generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A non-transitory computer readable medium comprising instructions stored thereon that are executable by at least one processor to cause a computing device to perform operations comprising:
-
processing an image to produce a feature map of the image; analyzing the feature map to generate proposed bounding boxes for a plurality of visual concepts within the image; cropping a respective region from the feature map for each proposed bounding box to generate a plurality of region features of the image; analyzing the feature map to determine a context feature for the image using a proposed bounding box that is a largest in size of the proposed bounding boxes; and for each region feature of the plurality of region features of the image; analyzing the region feature to determine for the region feature a detection score that indicates a likelihood that the region feature comprises an actual object; generating a caption for a bounding box for a visual concept in the image using the region feature and the context feature; and localizing the visual concept by adjusting the bounding box around the visual concept based on the caption to generate an adjusted bounding box for the visual concept.
-
Specification