Jointly modeling embedding and translation to bridge video and language
First Claim
Patent Images
1. An apparatus comprising:
- a processor; and
a computer-readable medium storing modules of instructions that, when executed by the processor, configure the apparatus to perform video description generation, the modules comprising;
a training module to configure the processor to train a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold, the training module further configured to adjust one or more parameters associated with the semantics transformation matrix in response to an energy value being applied to a recurrent neural network;
a video description module to configure the processor to generate a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix and the semantics transformation matrix; and
an output module to configure the processor to generate an output based at least in part on the textual description for the inputted video.
1 Assignment
0 Petitions
Accused Products
Abstract
Video description generation using neural network training based on relevance and coherence is described. In some examples, long short-term memory with visual-semantic embedding (LSTM-E) can maximize the probability of generating the next word given previous words and visual content and can create a visual-semantic embedding space for enforcing the relationship between the semantics of an entire sentence and visual content. LSTM-E can include a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep recurrent neural network for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics.
30 Citations
20 Claims
-
1. An apparatus comprising:
-
a processor; and a computer-readable medium storing modules of instructions that, when executed by the processor, configure the apparatus to perform video description generation, the modules comprising; a training module to configure the processor to train a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold, the training module further configured to adjust one or more parameters associated with the semantics transformation matrix in response to an energy value being applied to a recurrent neural network; a video description module to configure the processor to generate a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix and the semantics transformation matrix; and an output module to configure the processor to generate an output based at least in part on the textual description for the inputted video. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a processor; and a computer-readable media including instructions that, when executed by the processor, configure the processor to; train a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold; determine an energy value for a first of the plurality of video/descriptive text pairs based at least on the video content transformation matrix and the semantics transformation matrix; generate a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix and the semantics transformation matrix; and generate an output based at least in part on the textual description for the inputted video. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method comprising:
-
training a neural network, a video content transformation matrix, and a semantics transformation matrix based at least in part on a plurality of video/descriptive text pairs, a coherence loss threshold, and a relevance loss threshold; generating a textual description for an inputted video based at least in part on information associated with the inputted video, the neural network, the video content transformation matrix, and the semantics transformation matrix; generating a recurrent neural network (RNN) model to identify a relationship between the video and the textual description of the inputted video, wherein the RNN comprises one or more parameters optimized to minimize at least one of a relevance loss value or a coherence loss value; and generating an output based at least in part on the textual description for the inputted video. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification