Three-dimensional convolutional neural networks for video highlight detection

US 9,836,853 B1
Filed: 09/06/2016
Issued: 12/05/2017
Est. Priority Date: 09/06/2016
Status: Active Grant

First Claim

Patent Images

1. A three-dimensional convolutional neural network system for video highlight detection, the system comprising:

one or more physical processors configured by machine-readable instructions to;

access video content, the video content having a duration;

segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration;

input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising;

a preliminary layer group that, for the individual video segments;

accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels,increases the dimensionality of the video segment map;

convolves the video segment map to produce a first set of feature maps;

applies a first activating function to the first set of feature maps;

normalizes the first set of feature maps; and

downsamples the first set of feature maps;

one or more intermediate layer groups that, for the individual video segments;

receives a first output from a layer preceding the individual intermediate layer group;

convolves the first output to reduce a number of channels of the first output;

normalizes the first output;

increases the dimensionality of the first output;

convolves the first output to produce a second set of feature maps;

convolves the first output to produce a third set of feature maps;

concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps;

normalizes the set of concatenated feature maps;

applies a second activating function to the set of concatenated feature maps; and

combines the set of concatenated feature maps and the first output; and

a final layer group that, for the individual video segments;

receives a second output from a layer preceding the final layer group;

reduces an overfitting from the second output;

convolves the second output to produce a fourth set of feature maps;

applies a third activating function to the fourth set of feature maps;

normalizes the fourth set of feature maps;

downsamples the fourth set of feature maps; and

converts the fourth set of feature maps into a spatiotemporal feature vector;

input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and

determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A three-dimensional convolutional neural network may include a preliminary layer group, one or more intermediate layer groups, a final layer group, and/or other layers/layer groups. The preliminary layer group may include an input layer, a preliminary three-dimensional padding layer, a preliminary three-dimensional convolution layer, a preliminary activation layer, a preliminary normalization layer, and a preliminary downsampling layer. One or more intermediate layer groups may include an intermediate three-dimensional squeeze layer, a first intermediate normalization layer, an intermediate three-dimensional padding layer, a first intermediate three-dimensional expand layer, a second intermediate three-dimensional expand layer, an intermediate concatenation layer, a second intermediate normalization layer, an intermediate activation layer, and an intermediate combination layer. The final layer group may include a final dropout layer, a final three-dimensional convolution layer, a final activation layer, a final normalization layer, a final three-dimensional downsampling layer, and a final flatten layer.

195 Citations

20 Claims

1. A three-dimensional convolutional neural network system for video highlight detection, the system comprising:
- one or more physical processors configured by machine-readable instructions to;
  
  access video content, the video content having a duration;
  
  segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration;
  
  input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising;
  
  a preliminary layer group that, for the individual video segments;
  
  accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels,increases the dimensionality of the video segment map;
  
  convolves the video segment map to produce a first set of feature maps;
  
  applies a first activating function to the first set of feature maps;
  
  normalizes the first set of feature maps; and
  
  downsamples the first set of feature maps;
  
  one or more intermediate layer groups that, for the individual video segments;
  
  receives a first output from a layer preceding the individual intermediate layer group;
  
  convolves the first output to reduce a number of channels of the first output;
  
  normalizes the first output;
  
  increases the dimensionality of the first output;
  
  convolves the first output to produce a second set of feature maps;
  
  convolves the first output to produce a third set of feature maps;
  
  concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps;
  
  normalizes the set of concatenated feature maps;
  
  applies a second activating function to the set of concatenated feature maps; and
  
  combines the set of concatenated feature maps and the first output; and
  
  a final layer group that, for the individual video segments;
  
  receives a second output from a layer preceding the final layer group;
  
  reduces an overfitting from the second output;
  
  convolves the second output to produce a fourth set of feature maps;
  
  applies a third activating function to the fourth set of feature maps;
  
  normalizes the fourth set of feature maps;
  
  downsamples the fourth set of feature maps; and
  
  converts the fourth set of feature maps into a spatiotemporal feature vector;
  
  input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and
  
  determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein individual predicted spatiotemporal feature vectors corresponding to the individual video segments characterizes a prediction of a video segment following the individual video segments within the duration.
  - 3. The system of claim 1, wherein individual predicted spatiotemporal feature vectors for the individual video segments characterizes a prediction of a video segment preceding the individual video segments within the duration.
  - 4. The system of claim 2, wherein:
    - the first set of spatiotemporal feature vectors includes a first spatiotemporal feature vector corresponding to the first video segment and a second spatiotemporal feature vector corresponding to the second video segment;
      
      the first set of predicted spatiotemporal feature vectors includes a first predicted spatiotemporal feature vector determined based on the first spatiotemporal feature vector, the first predicted spatiotemporal feature vector characterizing a prediction of the second video segment; and
      
      the comparison of the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors includes a comparison of the second spatiotemporal feature vector with the first predicted spatiotemporal feature vector.
  - 5. The system of claim 1, wherein the presence of the highlight moment within the video content is determined based on a difference between the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors and the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors meeting or being below a threshold.
  - 6. The system of claim 1, wherein the one or more physical processors are further configured by machine-readable instructions to input two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors into a categorization layer, the categorization layer determining a category for the video content based on the two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors.
  - 7. The system of claim 1, wherein the first number of video frames includes sixteen video frames.
  - 8. The system of claim 1, wherein the one or more physical processors are further configured by machine-readable instructions to:
    - segment the video content into a second set of video segments, individual video segments within the second set of video segments including a second number of video frames, the second number of video frames being different from the first number of video frames;
      
      input the second set of video segments into a second three-dimensional convolutional neural network, the second three-dimensional convolutional neural network outputting a second set of spatiotemporal feature vectors corresponding to the second set of video segments;
      
      input the second set of spatiotemporal feature vectors into the long short-term memory network, the long short-term memory network determining a second set of predicted spatiotemporal feature vectors based on the second set of spatiotemporal feature vectors; and
      
      determine the presence of the highlight moment within the video content further based on a comparison of one or more spatiotemporal feature vectors of the second set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the second set of predicted spatiotemporal feature vectors.
  - 9. The system of claim 1, wherein the first three-dimensional convolutional neural network is initialized with pre-trained weights from a trained two-dimensional convolutional neural network, the pre-trained weights from the trained two-dimensional convolutional neural network being stacked along a time dimension.
  - 10. The system of claim 1, wherein the long short-term memory network is trained with second video content including highlights.

11. A method for using a three-dimensional convolutional neural network for video highlight detection, the method comprising:
- accessing video content, the video content having a duration;
  
  segmenting the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration;
  
  inputting the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising;
  
  a preliminary layer group that, for the individual video segments;
  
  accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels,increases the dimensionality of the video segment map;
  
  convolves the video segment map to produce a first set of feature maps;
  
  applies a first activating function to the first set of feature maps;
  
  normalizes the first set of feature maps; and
  
  downsamples the first set of feature maps;
  
  one or more intermediate layer groups that, for the individual video segments;
  
  receives a first output from a layer preceding the individual intermediate layer group;
  
  convolves the first output to reduce a number of channels of the first output;
  
  normalizes the first output;
  
  increases the dimensionality of the first output;
  
  convolves the first output to produce a second set of feature maps;
  
  convolves the first output to produce a third set of feature maps;
  
  concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps;
  
  normalizes the set of concatenated feature maps;
  
  applies a second activating function to the set of concatenated feature maps; and
  
  combines the set of concatenated feature maps and the first output; and
  
  a final layer group that, for the individual video segments;
  
  receives a second output from a layer preceding the final layer group;
  
  reduces an overfitting from the second output;
  
  convolves the second output to produce a fourth set of feature maps;
  
  applies a third activating function to the fourth set of feature maps;
  
  normalizes the fourth set of feature maps;
  
  downsamples the fourth set of feature maps; and
  
  converts the fourth set of feature maps into a spatiotemporal feature vector;
  
  inputting the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors; and
  
  determining a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11, wherein individual predicted spatiotemporal feature vectors corresponding to the individual video segments characterizes a prediction of a video segment following the individual video segments within the duration.
  - 13. The method of claim 11, wherein individual predicted spatiotemporal feature vectors for the individual video segments characterizes a prediction of a video segment preceding the individual video segments within the duration.
  - 14. The method of claim 12, wherein:
    - the first set of spatiotemporal feature vectors includes a first spatiotemporal feature vector corresponding to the first video segment and a second spatiotemporal feature vector corresponding to the second video segment;
      
      the first set of predicted spatiotemporal feature vectors includes a first predicted spatiotemporal feature vector determined based on the first spatiotemporal feature vector, the first predicted spatiotemporal feature vector characterizing a prediction of the second video segment; and
      
      the comparison of the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors includes a comparison of the second spatiotemporal feature vector with the first predicted spatiotemporal feature vector.
  - 15. The method of claim 11, wherein the presence of the highlight moment within the video content is determined based on a difference between the one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors and the one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors meeting or being below a threshold.
  - 16. The method of claim 11, further comprising inputting two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors into a categorization layer, the categorization layer determining a category for the video content based on the two or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors.
  - 17. The method of claim 11, wherein the first number of video frames includes sixteen video frames.
  - 18. The method of claim 11, further comprising:
    - segmenting the video content into a second set of video segments, individual video segments within the second set of video segments including a second number of video frames, the second number of video frames being different from the first number of video frames;
      
      inputting the second set of video segments into a second three-dimensional convolutional neural network, the second three-dimensional convolutional neural network outputting a second set of spatiotemporal feature vectors corresponding to the second set of video segments;
      
      inputting the second set of spatiotemporal feature vectors into the long short-term memory network, the long short-term memory network determining a second set of predicted spatiotemporal feature vectors based on the second set of spatiotemporal feature vectors; and
      
      determining the presence of the highlight moment within the video content further based on a comparison of one or more spatiotemporal feature vectors of the second set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the second set of predicted spatiotemporal feature vectors.
  - 19. The method of claim 11, wherein:
    - the first three-dimensional convolutional neural network is initialized with pre-trained weights from a trained two-dimensional convolutional neural network, the pre-trained weights from the trained two-dimensional convolutional neural network being stacked along a time dimension; and
      
      the long short-term memory network is trained with second video content including highlights.

20. A three-dimensional convolutional neural network system for video highlight detection, the system comprising:
- one or more physical processors configured by machine-readable instructions to;
  
  access video content, the video content having a duration;
  
  segment the video content into a first set of video segments, individual video segments within the first set of video segments including a first number of video frames, the first set of video segments comprising a first video segment and a second video segment, the second video segment following the first video segment within the duration;
  
  input the first set of video segments into a first three-dimensional convolutional neural network, the first three-dimensional convolutional neural network outputting a first set of spatiotemporal feature vectors corresponding to the first set of video segments, wherein the first three-dimensional convolutional neural network includes a sequence of layers comprising;
  
  a preliminary layer group that, for the individual video segments;
  
  accesses a video segment map, the video segment map characterized by a height dimension, a width dimension, a number of video frames, and a number of channels,increases the dimensionality of the video segment map;
  
  convolves the video segment map to produce a first set of feature maps;
  
  applies a first activating function to the first set of feature maps;
  
  normalizes the first set of feature maps; and
  
  downsamples the first set of feature maps;
  
  one or more intermediate layer groups that, for the individual video segments;
  
  receives a first output from a layer preceding the individual intermediate layer group;
  
  convolves the first output to reduce a number of channels of the first output;
  
  normalizes the first output;
  
  increases the dimensionality of the first output;
  
  convolves the first output to produce a second set of feature maps;
  
  convolves the first output to produce a third set of feature maps;
  
  concatenates the second set of feature maps and the third set of feature maps to produce a set of concatenated feature maps;
  
  normalizes the set of concatenated feature maps;
  
  applies a second activating function to the set of concatenated feature maps; and
  
  combines the set of concatenated feature maps and the first output; and
  
  a final layer group that, for the individual video segments;
  
  receives a second output from a layer preceding the final layer group;
  
  reduces an overfitting from the second output;
  
  convolves the second output to produce a fourth set of feature maps;
  
  applies a third activating function to the fourth set of feature maps;
  
  normalizes the fourth set of feature maps;
  
  downsamples the fourth set of feature maps; and
  
  converts the fourth set of feature maps into a spatiotemporal feature vector;
  
  input the first set of spatiotemporal feature vectors into a long short-term memory network, the long short-term memory network determining a first set of predicted spatiotemporal feature vectors based on the first set of spatiotemporal feature vectors, individual predicted spatiotemporal feature vectors corresponding to the individual video segments characterizes a prediction of a video segment following the individual video segments within the duration and/or a video segment preceding the individual video segments within the duration; and
  
  determine a presence of a highlight moment within the video content based on a comparison of one or more spatiotemporal feature vectors of the first set of spatiotemporal feature vectors with one or more predicted spatiotemporal feature vectors of the first set of predicted spatiotemporal feature vectors;
  
  wherein;
  
  the first three-dimensional convolutional neural network is initialized with pre-trained weights from a trained two-dimensional convolutional neural network, the pre-trained weights from the trained two-dimensional convolutional neural network being stacked along a time dimension; and
  
  the long short-term memory network is trained with second video content including highlights.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
GoPro, Inc.
Original Assignee
GoPro, Inc.
Inventors
Mdioni, Tom
Primary Examiner(s)
Cese, Kenny

Application Number

US15/256,874
Time in Patent Office

455 Days
Field of Search
US Class Current
CPC Class Codes

G06F 18/24143   Distances to neighbourhood ...

G06N 3/04   Architecture, e.g. intercon...

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/08   Learning methods

G06T 2207/10016   Video; Image sequence

G06T 2207/20081   Training; Learning

G06T 2207/20084   Artificial neural networks ...

G06T 7/246   using feature-based methods...

G06V 10/454   Integrating the filters int...

G06V 10/764   using classification, e.g. ...

G06V 10/82   using neural networks

G06V 20/40   in video content extracting...

Three-dimensional convolutional neural networks for video highlight detection

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

195 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Three-dimensional convolutional neural networks for video highlight detection

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

195 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links