Jointly modeling embedding and translation to bridge video and language

US 9,807,473 B2
Filed: 11/20/2015
Issued: 10/31/2017
Est. Priority Date: 11/20/2015
Status: Active Grant

First Claim

Patent Images

1. An apparatus comprising:

a processor; and

a computer-readable medium storing modules of instructions that, when executed by the processor, configure the apparatus to perform video description generation, the modules comprising;

a training module to configure the processor to train a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold, the training module further configured to adjust one or more parameters associated with the semantics transformation matrix in response to an energy value being applied to a recurrent neural network;

a video description module to configure the processor to generate a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix and the semantics transformation matrix; and

an output module to configure the processor to generate an output based at least in part on the textual description for the inputted video.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Video description generation using neural network training based on relevance and coherence is described. In some examples, long short-term memory with visual-semantic embedding (LSTM-E) can maximize the probability of generating the next word given previous words and visual content and can create a visual-semantic embedding space for enforcing the relationship between the semantics of an entire sentence and visual content. LSTM-E can include a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep recurrent neural network for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics.

30 Citations

View as Search Results

20 Claims

1. An apparatus comprising:
- a processor; and
  
  a computer-readable medium storing modules of instructions that, when executed by the processor, configure the apparatus to perform video description generation, the modules comprising;
  
  a training module to configure the processor to train a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold, the training module further configured to adjust one or more parameters associated with the semantics transformation matrix in response to an energy value being applied to a recurrent neural network;
  
  a video description module to configure the processor to generate a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix and the semantics transformation matrix; and
  
  an output module to configure the processor to generate an output based at least in part on the textual description for the inputted video.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The apparatus of claim 1, wherein the descriptive text of individual ones of the plurality of video/descriptive text pairs comprises a sentence.
  - 3. The apparatus of claim 1, wherein the training module is further to configure the processor to:
    - determine an energy value for a first of the plurality of video/descriptive text pairs based at least on the video content transformation matrix and the semantics transformation matrix;
      
      apply the energy value to a recurrent neural network (RNN); and
      
      adjust one or more parameters associated with the transformation matrices in response to the energy value being applied to the RNN.
  - 4. The apparatus of claim 1, wherein the training module is further to configure the processor to:
    - determine the energy value for a first of the plurality of video/descriptive text pairs based at least on the video content transformation matrix and the semantics transformation matrix;
      
      apply the energy value to a long short-term memory (LSTM)-type recurrent neural network (RNN); and
      
      adjust the one or more parameters associated with one or more of the transformation matrices in response to the energy value being applied to the LSTM-type RNN.
  - 5. The apparatus of claim 1, wherein the training module is further to configure the processor to:
    - determine a feature vector of the video of a first of the plurality of video/descriptive text pairs;
      
      project the feature vector of the video to an embedding space using the video content transformation matrix;
      
      determine semantics of the descriptive text of the first of the plurality of video/descriptive text pairs;
      
      project the semantics of the descriptive text to the embedding space using the semantics transformation matrix;
      
      determine a relevance loss value based at least in part on the projection of the feature vector of the video and the projection of the semantics of the descriptive text;
      
      determine a coherence loss value based at least in part on the projection of the feature vector of the video and the projection of the semantics of the descriptive text; and
      
      generate a long short-term memory (LSTM)-type recurrent neural network (RNN) modeled to identify a relationship between the video and the descriptive text of the first of the video/descriptive text pairs, wherein the LSTM-type RNN comprises one or more parameters optimized to minimize at least one of the relevance loss value or the coherence loss value based at least in part on the relevance loss threshold or coherence loss threshold.
  - 6. The apparatus of claim 1, wherein the output module is further to configure the processor to:
    - configure the output for inclusion in a searchable database.
  - 7. The apparatus of claim 1, wherein the training module is further to configure the processor to:
    - determine a feature vector of video of a first of the plurality of video/descriptive text pairs using at least one of a two dimensional (2D) convolutional neural network (CNN) or three dimensional (3D) CNN.

8. A system comprising:
- a processor; and
  
  a computer-readable media including instructions that, when executed by the processor, configure the processor to;
  
  train a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold;
  
  determine an energy value for a first of the plurality of video/descriptive text pairs based at least on the video content transformation matrix and the semantics transformation matrix;
  
  generate a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix and the semantics transformation matrix; and
  
  generate an output based at least in part on the textual description for the inputted video.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the textual description comprises a sentence.
  - 10. The system of claim 8, wherein the computer-readable media includes further instructions that, when executed by the processor, further configure the processor to:
    - determine an energy value for a first of the plurality of video/descriptive text pairs based at least on the video content transformation matrix and the semantics transformation matrix;
      
      apply the energy value to a recurrent neural network (RNN); and
      
      adjust one or more parameters associated with the transformation matrices in response to the energy value being applied to the RNN.
  - 11. The system of claim 8, wherein the computer-readable media includes further instructions that, when executed by the processor, further configure the processor to:
    - apply the energy value to a long short-term memory (LSTM)-type recurrent neural network (RNN); and
      
      adjust one or more parameters associated with the video content transformation matrix and the semantics transformation matrix in response to the energy value being applied to the LSTM-type RNN.
  - 12. The system of claim 8, wherein the computer-readable media includes further instructions that, when executed by the processor, further configure the processor to:
    - determine a representation of the video of a first of the plurality of video/descriptive text pairs;
      
      project the representation of the video to an embedding space using the video content transformation matrix;
      
      determine a semantics representation of the descriptive text of the first of the plurality of video/descriptive text pairs;
      
      project the semantics representation of the descriptive text to the embedding space using the semantics transformation matrix;
      
      determine a relevance loss value based at least in part on the projection of the representation of the video and the projection of the semantics representation of the descriptive text;
      
      determine a coherence loss value based at least in part on the projection of the representation of the video and the projection of the semantics representation of the descriptive text; and
      
      generate a long short-term memory (LSTM)-type recurrent neural network (RNN) modeled to identify a relationship between the video and the descriptive text of the first of the video/descriptive text pairs, wherein the LSTM-type RNN comprises one or more parameters optimized to minimize at least one of the relevance loss value or the coherence loss value based at least in part on the relevance loss threshold or coherence loss threshold.
  - 13. The system of claim 8, wherein the computer-readable media includes further instructions that, when executed by the processor, further configure the processor to:
    - configure the output for inclusion in a searchable database.
  - 14. The system of claim 8, wherein the computer-readable media includes further instructions that, when executed by the processor, further configure the processor to:
    - determine content of video of a first of the plurality of video/descriptive text pairs using at least one of a 2D convolutional neural network (CNN) or 3D CNN.

15. A method comprising:
- training a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold;
  
  generating a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix, and the semantics transformation matrix;
  
  generating a recurrent neural network (RNN) model to identify a relationship between the video and the textual description of the inputted video, wherein the RNN comprises one or more parameters optimized to minimize at least one of a relevance loss value or a coherence loss value; and
  
  generating an output based at least in part on the textual description for the inputted video.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15, further comprising:
    - determining an energy value for a first of the plurality of video/descriptive text pairs based at least on the video content transformation matrix and the semantics transformation matrix;
      
      applying the energy value to a long short-term memory (LSTM)-type recurrent neural network (RNN); and
      
      adjusting one or more parameters associated with the video content transformation matrix and the semantics transformation matrix in response to the energy value being applied to the LSTM-type RNN.
  - 17. The method of claim 15, further comprising:
    - determining a representation of the video of a first of the plurality of video/descriptive text pairs;
      
      projecting the representation of the video to an embedding space using the video content transformation matrix;
      
      determining semantics of the descriptive text of the first of the plurality of video/descriptive text pairs;
      
      projecting the semantics of the descriptive text to the embedding space using the semantics transformation matrix;
      
      determining the relevance loss value based at least in part on the projection of the representation of the video and the projection of the semantics of the descriptive text;
      
      determining the coherence loss value based at least in part on the projection of the representation of the video and the projection of the semantics of the descriptive text; and
      
      wherein generating the RNN model includes generating a long short-term memory type recurrent neural network (LSTM type-RNN) modeled to identify a relationship between the video and the descriptive text of the first of the video/descriptive text pairs, wherein the LSTM-type RNN comprises the one or more parameters optimized to minimize at least one of the relevance loss value or the coherence loss value.
  - 18. The method of claim 15, wherein the textual description and the descriptive text of the plurality of video/descriptive text pairs comprises a sentence.
  - 19. The method of claim 15, further comprising:
    - configuring the output for inclusion in a searchable database.
  - 20. The method of claim 15, further comprising:
    - determining a representation of video of a first of the plurality of video/descriptive text pairs using at least one of a 2D convolutional neural network (CNN) or 3D CNN.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Mei, Tao, Yao, Ting, Rui, Yong
Primary Examiner(s)
ROBERTS, SHAUN A

Application Number

US14/946,988
Publication Number

US 20170150235A1
Time in Patent Office

711 Days
Field of Search

704 9, 706 25, 725 53
US Class Current
CPC Class Codes

G06F 18/2414   Smoothing the distance, e.g...

G06F 40/253   Grammatical analysis; Style...

G06F 40/30   Semantic analysis

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/08   Learning methods

G06V 10/764   using classification, e.g. ...

G06V 10/82   using neural networks

G06V 20/41   Higher-level, semantic clus...

G06V 20/70   Labelling scene content, e....

H04N 21/26603   for automatically generatin...

H04N 21/8405   represented by keywords

Jointly modeling embedding and translation to bridge video and language

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

30 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Jointly modeling embedding and translation to bridge video and language

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

30 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links