Three-dimensional convolutional neural networks for video highlight detection
First Claim
1. A three-dimensional convolutional neural network system for video highlight detection, the system comprising:
- one or more physical processors configured by machine-readable instructions to;
access video content, the video content having a duration;
segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration;
input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising;
a preliminary layer group that, for the individual video segments;
accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels,increases the dimensionality of the video segment map;
convolves the video segment map to produce a first set of feature maps;
applies a first activating function to the first set of feature maps;
normalizes the first set of feature maps; and
downsamples the first set of feature maps;
one or more intermediate layer groups that, for the individual video segments;
receives a first output from a layer preceding the individual intermediate layer group;
convolves the first output to reduce a number of channels of the first output;
normalizes the first output;
increases the dimensionality of the first output;
convolves the first output to produce a second set of feature maps;
convolves the first output to produce a third set of feature maps;
concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps;
normalizes the set of concatenated feature maps;
applies a second activating function to the set of concatenated feature maps; and
combines the set of concatenated feature maps and the first output; and
a final layer group that, for the individual video segments;
receives a second output from a layer preceding the final layer group;
reduces an overfitting from the second output;
convolves the second output to produce a fourth set of feature maps;
applies a third activating function to the fourth set of feature maps;
normalizes the fourth set of feature maps;
downsamples the fourth set of feature maps; and
converts the fourth set of feature maps into a spatiotemporal feature vector;
input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and
determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors.
4 Assignments
0 Petitions
Accused Products
Abstract
A three-dimensional convolutional neural network may include a preliminary layer group, one or more intermediate layer groups, a final layer group, and/or other layers/layer groups. The preliminary layer group may include an input layer, a preliminary three-dimensional padding layer, a preliminary three-dimensional convolution layer, a preliminary activation layer, a preliminary normalization layer, and a preliminary downsampling layer. One or more intermediate layer groups may include an intermediate three-dimensional squeeze layer, a first intermediate normalization layer, an intermediate three-dimensional padding layer, a first intermediate three-dimensional expand layer, a second intermediate three-dimensional expand layer, an intermediate concatenation layer, a second intermediate normalization layer, an intermediate activation layer, and an intermediate combination layer. The final layer group may include a final dropout layer, a final three-dimensional convolution layer, a final activation layer, a final normalization layer, a final three-dimensional downsampling layer, and a final flatten layer.
195 Citations
20 Claims
-
1. A three-dimensional convolutional neural network system for video highlight detection, the system comprising:
one or more physical processors configured by machine-readable instructions to; access video content, the video content having a duration; segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration; input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising; a preliminary layer group that, for the individual video segments; accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels, increases the dimensionality of the video segment map; convolves the video segment map to produce a first set of feature maps; applies a first activating function to the first set of feature maps; normalizes the first set of feature maps; and downsamples the first set of feature maps; one or more intermediate layer groups that, for the individual video segments; receives a first output from a layer preceding the individual intermediate layer group; convolves the first output to reduce a number of channels of the first output; normalizes the first output; increases the dimensionality of the first output; convolves the first output to produce a second set of feature maps; convolves the first output to produce a third set of feature maps; concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps; normalizes the set of concatenated feature maps; applies a second activating function to the set of concatenated feature maps; and combines the set of concatenated feature maps and the first output; and a final layer group that, for the individual video segments; receives a second output from a layer preceding the final layer group; reduces an overfitting from the second output; convolves the second output to produce a fourth set of feature maps; applies a third activating function to the fourth set of feature maps; normalizes the fourth set of feature maps; downsamples the fourth set of feature maps; and converts the fourth set of feature maps into a spatiotemporal feature vector; input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
11. A method for using a three-dimensional convolutional neural network for video highlight detection, the method comprising:
-
accessing video content, the video content having a duration; segmenting the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration; inputting the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising; a preliminary layer group that, for the individual video segments; accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels, increases the dimensionality of the video segment map; convolves the video segment map to produce a first set of feature maps; applies a first activating function to the first set of feature maps; normalizes the first set of feature maps; and downsamples the first set of feature maps; one or more intermediate layer groups that, for the individual video segments; receives a first output from a layer preceding the individual intermediate layer group; convolves the first output to reduce a number of channels of the first output; normalizes the first output; increases the dimensionality of the first output; convolves the first output to produce a second set of feature maps; convolves the first output to produce a third set of feature maps; concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps; normalizes the set of concatenated feature maps; applies a second activating function to the set of concatenated feature maps; and combines the set of concatenated feature maps and the first output; and a final layer group that, for the individual video segments; receives a second output from a layer preceding the final layer group; reduces an overfitting from the second output; convolves the second output to produce a fourth set of feature maps; applies a third activating function to the fourth set of feature maps; normalizes the fourth set of feature maps; downsamples the fourth set of feature maps; and converts the fourth set of feature maps into a spatiotemporal feature vector; inputting the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and determining a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A three-dimensional convolutional neural network system for video highlight detection, the system comprising:
-
one or more physical processors configured by machine-readable instructions to; access video content, the video content having a duration; segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration; input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising; a preliminary layer group that, for the individual video segments; accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels, increases the dimensionality of the video segment map; convolves the video segment map to produce a first set of feature maps; applies a first activating function to the first set of feature maps; normalizes the first set of feature maps; and downsamples the first set of feature maps; one or more intermediate layer groups that, for the individual video segments; receives a first output from a layer preceding the individual intermediate layer group; convolves the first output to reduce a number of channels of the first output; normalizes the first output; increases the dimensionality of the first output; convolves the first output to produce a second set of feature maps; convolves the first output to produce a third set of feature maps; concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps; normalizes the set of concatenated feature maps; applies a second activating function to the set of concatenated feature maps; and combines the set of concatenated feature maps and the first output; and a final layer group that, for the individual video segments; receives a second output from a layer preceding the final layer group; reduces an overfitting from the second output; convolves the second output to produce a fourth set of feature maps; applies a third activating function to the fourth set of feature maps; normalizes the fourth set of feature maps; downsamples the fourth set of feature maps; and converts the fourth set of feature maps into a spatiotemporal feature vector; input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors, individual predicted spatiotemporal feature vectors corresponding to the individual video segments characterizes a prediction of a video segment following the individual video segments within the duration and/or a video segment preceding the individual video segments within the duration; and determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors; wherein; the first three-dimensional convolutional neural network is initialized with pre-trained weights from a trained two-dimensional convolutional neural network, the pre-trained weights from the trained two-dimensional convolutional neural network being stacked along a time dimension; and the long short-term memory network is trained with second video content including highlights.
-
Specification