Deep reinforcement learning-based captioning with embedding reward

US 10,467,274 B1
Filed: 11/09/2017
Issued: 11/05/2019
Est. Priority Date: 11/10/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

extracting, by an image captioning system, an image feature from an image;

analyzing, by a policy network of the image captioning system, the image feature to compute a probability of a next word to be generated for a caption describing the image feature, the probability comprising a list of options for the next word and a policy network score for each possible option in the list of options;

ranking, by the policy network of the image captioning system, the list of options for the next word of the caption based on the policy network score for each possible option in the list of options;

analyzing, by a value network of the image captioning system, the image feature and the probability of the next word generated by the policy network to generate a value network score for each possible option in the list of options;

ranking, by the value network, the list of options for the next word of the caption based on the value network score; and

selecting, by the image captioning system, a next word for the caption based on the ranking of the list of options by the policy network and the ranking of the list of options by the value network.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An image captioning system and method is provided for generating a caption for an image. The image captioning system utilizes a policy network and a value network to generate the caption. The policy network serves as a local guidance and the value network serves as a global and lookahead guidance.

Citations

20 Claims

1. A method comprising:
- extracting, by an image captioning system, an image feature from an image;
  
  analyzing, by a policy network of the image captioning system, the image feature to compute a probability of a next word to be generated for a caption describing the image feature, the probability comprising a list of options for the next word and a policy network score for each possible option in the list of options;
  
  ranking, by the policy network of the image captioning system, the list of options for the next word of the caption based on the policy network score for each possible option in the list of options;
  
  analyzing, by a value network of the image captioning system, the image feature and the probability of the next word generated by the policy network to generate a value network score for each possible option in the list of options;
  
  ranking, by the value network, the list of options for the next word of the caption based on the value network score; and
  
  selecting, by the image captioning system, a next word for the caption based on the ranking of the list of options by the policy network and the ranking of the list of options by the value network.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 18)
- - 2. The method of claim 1, wherein the selected next word has the highest combined score of the policy network score and the value network score.
  - 3. The method of claim 1, wherein the value network based on previously generated words in the caption that have been generated before the probability of the next word when each possible option is combined with the previously generated words.
  - 4. The method of claim 1, wherein the policy network is pre-trained using supervised learning with cross entropy loss.
  - 5. The method of claim 4, wherein the value network is pre-trained with mean square loss.
  - 6. The method of claim 5, wherein after the pre-training of the policy network and the pre-training of the value network, the policy network and the value network are trained by deep reinforcement learning.
  - 7. The method of claim 1, wherein the policy network comprises a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN).
  - 8. The method of claim 7, wherein the policy network computes the probability of the next word to be generated by encoding visual information of the image feature using the CNN and inputting the encoded visual information into the RNN.
  - 9. The method of claim 1, wherein the value network comprises a CNN, a RNN, and a Multilayer Perceptron (MLP).
  - 10. The method of claim 9, wherein the value network score for each possible option in the list of options is generated by the value network by encoding visual information of the image feature using the CNN, encoding semantic information of a partially generated sentence using the RNN, and regressing a scalar reward from a concatenated visual and semantic feature vector based on the encoded visual information and semantic information.
  - 11. The method of claim 1, wherein selecting the next word for the caption further comprises utilizing a lookahead beam search to correct errors using a later word context.
  - 18. The image captioning system of claim 8, wherein the value network comprises a CNN, a RNN, and a Multilayer Perceptron (MLP).

12. An image captioning system comprising:
- one or more processors; and
  
  a computer-readable medium coupled with the processor, the computer-readable medium comprising instructions stored thereon that are executable by the one or more processors to cause the imaging captioning system to perform operations comprising;
  
  extracting an image feature from an image;
  
  analyzing, by a policy network of the image captioning system, the image feature to compute a probability of a next word to be generated for a caption describing the image feature, the probability comprising a list of options for the next word and a policy network score for each possible option in the list of options;
  
  ranking, by the policy network of the image captioning system, the list of options for the next word of the caption based on the policy network score for each possible option in the list of options;
  
  analyzing, by a value network of the image captioning system, the image feature and the probability of the next word generated by the policy network to generate a value network score for each possible option in the list of options;
  
  ranking, by the value network, the list of options for the next word of the caption based on the value network score; and
  
  selecting, by the image captioning system, a next word for the caption based on the ranking of the list of options by the policy network and the ranking of the list of options by the value network.
- View Dependent Claims (13, 14, 15, 16, 17, 19)
- - 13. The image captioning system of claim 12, wherein the selected next word has the highest combined score of the policy network score and the value network score.
  - 14. The image captioning system of claim 12, wherein the value network score is generated based on previously generated words in the caption that have been generated before the probability of the next word when each possible option is combined with the previously generated words.
  - 15. The image captioning system of claim 12, wherein the policy network is pre-trained using supervised learning with cross entropy loss and wherein the value network is pre-trained with mean square loss.
  - 16. The image captioning system of claim 15, wherein after the pre-training of the policy network and the pre-training of the value network, the policy network and the value network are trained by deep reinforcement learning.
  - 17. The image captioning system of claim 12, wherein the policy network comprises a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) and wherein the policy network computes the probability of the next word to be generated by encoding visual information of the image feature using the CNN and inputting the encoded visual information into the RNN.
  - 19. The image captioning system of claim 16, wherein the value network score for each possible option in the list of options is generated by the value network by encoding visual information of the image feature using the CNN, encoding semantic information of a partially generated sentence using the RNN, and regressing a scalar reward from a concatenated visual and semantic feature vector based on the encoded visual information and semantic information.

20. A non-transitory computer-readable medium comprising instructions stored thereon that are executable by at least one processor to cause a computing device to perform operations comprising:
- extracting an image feature from an image;
  
  analyzing, by a policy network of an image captioning system, the image feature to compute a probability of a next word to be generated for a caption describing the image feature, the probability comprising a list of options for the next word and a policy network score for each possible option in the list of options;
  
  ranking, by the policy network of the image captioning system, the list of options for the next word of the caption based on the policy network score for each possible option in the list of options;
  
  analyzing, by a value network of the image captioning system, the image feature and the probability of the next word generated by the policy network to generate a value network score for each possible option in the list of options;
  
  ranking, by the value network, the list of options for the next word of the caption based on the value network score; and
  
  selecting, by the image captioning system, a next word for the caption based on the ranking of the list of options by the policy network and the ranking of the list of options by the value network.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Snap, Inc.
Original Assignee
Snap, Inc.
Inventors
Ren, Zhou, Wang, Xiaoyu, Zhang, Ning, Lv, Xutao, Li, Jia
Primary Examiner(s)
Liew, Alex Kok S

Application Number

US15/808,617
Time in Patent Office

726 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/3344   using natural language anal...

G06F 16/338   Presentation of query results

G06F 18/256   of results relating to diff...

G06N 3/006   based on simulated virtual ...

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/084   Backpropagation, e.g. using...

G06N 5/02   Knowledge representation; S...

G06N 5/022   Knowledge engineering; Know...

G06V 10/811   the classifiers operating o...

G06V 10/82   using neural networks

G06V 10/94   Hardware or software archit...

Deep reinforcement learning-based captioning with embedding reward

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Deep reinforcement learning-based captioning with embedding reward

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links