ACTIVE LEARNING METHOD FOR TEMPORAL ACTION LOCALIZATION IN UNTRIMMED VIDEOS
First Claim
1. A computer-implemented method for training a localization model that comprises a neural network and identifies a temporal location of an action in a video stream, the method comprising:
- training, by a computer system, the localization model based on a set of labeled video samples;
for each unlabeled video sample in a set of unlabeled video samples, determining, by the computer system based on a trainable selection function, a predicted performance improvement of the localization model associated with retraining the localization model;
selecting, by the computer system based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples;
receiving by the computer system, a first annotation to the first unlabeled video sample, wherein the first annotation and the first unlabeled video sample form a first labeled video sample; and
retraining, by the computer system, the localization model based on the set of labeled video samples and the first labeled video sample, wherein an updated localization model is generated upon completion of the retraining.
2 Assignments
0 Petitions
Accused Products
Abstract
Various embodiments describe active learning methods for training temporal action localization models used to localize actions in untrimmed videos. A trainable active learning selection function is used to select unlabeled samples that can improve the temporal action localization model the most. The select unlabeled samples are then annotated and used to retrain the temporal action localization model. In some embodiment, the trainable active learning selection function includes a trainable performance prediction model that maps a video sample and a temporal action localization model to a predicted performance improvement for the temporal action localization model.
-
Citations
20 Claims
-
1. A computer-implemented method for training a localization model that comprises a neural network and identifies a temporal location of an action in a video stream, the method comprising:
-
training, by a computer system, the localization model based on a set of labeled video samples; for each unlabeled video sample in a set of unlabeled video samples, determining, by the computer system based on a trainable selection function, a predicted performance improvement of the localization model associated with retraining the localization model; selecting, by the computer system based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples; receiving by the computer system, a first annotation to the first unlabeled video sample, wherein the first annotation and the first unlabeled video sample form a first labeled video sample; and retraining, by the computer system, the localization model based on the set of labeled video samples and the first labeled video sample, wherein an updated localization model is generated upon completion of the retraining. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for training a localization model that identifies a temporal location of an action in a video stream, the system comprising:
-
means for training a selection function using a first set of labeled video samples; means for training the localization model based on a second set of labeled video samples; means for determining based on a trainable selection function, for each unlabeled video sample in a set of unlabeled video samples, a predicted performance improvement of the localization model associated with retraining the localization model; means for selecting, based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples; means for receiving an annotation to the first unlabeled video sample, the annotation and the first unlabeled video sample forming a first labeled video sample; and means for retraining the localization model based on the second set of labeled video samples and the first labeled video sample. - View Dependent Claims (13, 14)
-
-
15. A computer-readable non-transitory storage medium storing computer-executable instructions for training a localization model that comprises a neural network and identifies a temporal location of an action in a video stream, wherein the instructions, when executed by one or more processing devices, cause the one or more processing devices to perform operations comprising:
-
training the localization model based on a set of labeled video samples; for each unlabeled video sample in a set of unlabeled video samples, determining, based on a trainable selection function, a predicted performance improvement of the localization model associated with retraining the localization model; selecting, based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples; receiving an annotation to the first unlabeled video sample, the annotation and the first unlabeled video sample forming a first labeled video sample; and retraining the localization model based on the set of labeled video samples and the first labeled video sample. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification