×

Three-dimensional convolutional neural networks for video highlight detection

  • US 9,836,853 B1
  • Filed: 09/06/2016
  • Issued: 12/05/2017
  • Est. Priority Date: 09/06/2016
  • Status: Active Grant
First Claim
Patent Images

1. A three-dimensional convolutional neural network system for video highlight detection, the system comprising:

  • one or more physical processors configured by machine-readable instructions to;

    access video content, the video content having a duration;

    segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration;

    input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising;

    a preliminary layer group that, for the individual video segments;

    accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels,increases the dimensionality of the video segment map;

    convolves the video segment map to produce a first set of feature maps;

    applies a first activating function to the first set of feature maps;

    normalizes the first set of feature maps; and

    downsamples the first set of feature maps;

    one or more intermediate layer groups that, for the individual video segments;

    receives a first output from a layer preceding the individual intermediate layer group;

    convolves the first output to reduce a number of channels of the first output;

    normalizes the first output;

    increases the dimensionality of the first output;

    convolves the first output to produce a second set of feature maps;

    convolves the first output to produce a third set of feature maps;

    concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps;

    normalizes the set of concatenated feature maps;

    applies a second activating function to the set of concatenated feature maps; and

    combines the set of concatenated feature maps and the first output; and

    a final layer group that, for the individual video segments;

    receives a second output from a layer preceding the final layer group;

    reduces an overfitting from the second output;

    convolves the second output to produce a fourth set of feature maps;

    applies a third activating function to the fourth set of feature maps;

    normalizes the fourth set of feature maps;

    downsamples the fourth set of feature maps; and

    converts the fourth set of feature maps into a spatiotemporal feature vector;

    input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and

    determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×