ACTIVE LEARNING METHOD FOR TEMPORAL ACTION LOCALIZATION IN UNTRIMMED VIDEOS

US 20190325275A1
Filed: 04/19/2018
Published: 10/24/2019
Est. Priority Date: 04/19/2018
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for training a localization model that comprises a neural network and identifies a temporal location of an action in a video stream, the method comprising:

training, by a computer system, the localization model based on a set of labeled video samples;

for each unlabeled video sample in a set of unlabeled video samples, determining, by the computer system based on a trainable selection function, a predicted performance improvement of the localization model associated with retraining the localization model;

selecting, by the computer system based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples;

receiving by the computer system, a first annotation to the first unlabeled video sample, wherein the first annotation and the first unlabeled video sample form a first labeled video sample; and

retraining, by the computer system, the localization model based on the set of labeled video samples and the first labeled video sample, wherein an updated localization model is generated upon completion of the retraining.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Various embodiments describe active learning methods for training temporal action localization models used to localize actions in untrimmed videos. A trainable active learning selection function is used to select unlabeled samples that can improve the temporal action localization model the most. The select unlabeled samples are then annotated and used to retrain the temporal action localization model. In some embodiment, the trainable active learning selection function includes a trainable performance prediction model that maps a video sample and a temporal action localization model to a predicted performance improvement for the temporal action localization model.

Citations

20 Claims

1. A computer-implemented method for training a localization model that comprises a neural network and identifies a temporal location of an action in a video stream, the method comprising:
- training, by a computer system, the localization model based on a set of labeled video samples;
  
  for each unlabeled video sample in a set of unlabeled video samples, determining, by the computer system based on a trainable selection function, a predicted performance improvement of the localization model associated with retraining the localization model;
  
  selecting, by the computer system based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples;
  
  receiving by the computer system, a first annotation to the first unlabeled video sample, wherein the first annotation and the first unlabeled video sample form a first labeled video sample; and
  
  retraining, by the computer system, the localization model based on the set of labeled video samples and the first labeled video sample, wherein an updated localization model is generated upon completion of the retraining.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The computer-implemented method of claim 1, wherein retraining the localization model comprises:
    - extracting feature vectors for a plurality of temporal segments in the first labeled video sample;
      
      selecting, based on the extracted feature vectors, a temporal segment that is estimated to be associated with an action; and
      
      classifying the action associated with the selected temporal segment.
  - 3. The computer-implemented method of claim 1, further comprising:
    - for each remaining unlabeled video sample in the set of unlabeled video samples, determining, based on the trainable selection function, a predicted performance improvement of the updated localization model associated with retraining the updated localization model;
      
      selecting, based on the predicted performance improvement of the updated localization model, a second unlabeled video sample from remaining unlabeled video samples in the set of unlabeled video samples;
      
      receiving a second annotation to the second unlabeled video sample, wherein the second annotation and the second unlabeled video sample form a second labeled video sample; and
      
      retraining, by the computer system, the updated localization model based on the set of labeled video samples, the first labeled video sample, and the second labeled video sample.
  - 4. The computer-implemented method of claim 1, further comprising:
    - training the trainable selection function based on a second set of labeled video samples.
  - 5. The computer-implemented method of claim 4, further comprising:
    - retraining the trainable selection function based on the first labeled video sample.
  - 6. The computer-implemented method of claim 4, wherein:
    - training the trainable selection function comprises determining a performance prediction model that maps a video sample and a current localization model to a predicted performance improvement for the current localization model associated with retraining the current localization model based on the video sample; and
      
      determining the predicted performance improvement of the localization model comprises determining the predicted performance improvement based on the performance prediction model.
  - 7. The computer-implemented method of claim 6, where determining the performance prediction model comprises:
    - splitting the second set of labeled video samples into a first subset of video samples and a second subset of video samples;
      
      training the current localization model based on the first subset of video samples;
      
      determining a performance of the current localization model on a set of test video samples;
      
      for each video sample in the second subset of video samples,retraining the current localization model based on the first subset of video samples and the video sample in the second subset, wherein a new localization model is generated upon completion of the retraining;
      
      determining a performance of the new localization model on the set of test video samples; and
      
      determining a performance improvement based on the performance of the new localization model and the performance of the current localization model; and
      
      determining the performance prediction model based on parameters of the current localization model, parameters of each video sample in the second subset of video samples, and the performance improvement for each video sample in the second subset of video samples.
  - 8. The computer-implemented method of claim 7, wherein determining the performance prediction model comprises:
    - performing a regression learning process using the performance improvement for each video sample in the second subset of video samples as a target vector, and using the parameters of the current localization model and the parameters of each video sample in the second subset of video samples as a feature matrix.
  - 9. The computer-implemented method of claim 7, wherein:
    - the parameters of each video sample in the second subset of video samples are associated with a histogram of confidence scores for the video sample; and
      
      the confidence scores for the video sample are generated by applying the current localization model to temporal segments of the video sample.
  - 10. The computer-implemented method of claim 1, further comprising:
    - selecting, by the computer system based on the predicted performance improvement of the localization model, a subset of at least one unlabeled video sample from the set of unlabeled video samples; and
      
      receiving, by the computer system, annotations to the subset of at least one unlabeled video sample, the annotations and the subset of at least one unlabeled video sample forming a subset of at least one labeled video sample,wherein retraining the localization model comprises retraining the localization model based on the set of labeled video samples, the first labeled video sample, and the subset of at least one labeled video sample.
  - 11. The computer-implemented method of claim 1, further comprising:
    - selecting a plurality of unlabeled video samples based on the trainable selection function;
      
      receiving, by the computer system, annotations to the plurality of unlabeled video samples, the plurality of unlabeled video samples and the annotations forming a plurality of labeled video samples; and
      
      adding the plurality of labeled video samples to a temporal action localization dataset for training another temporal action localization model.

12. A system for training a localization model that identifies a temporal location of an action in a video stream, the system comprising:
- means for training a selection function using a first set of labeled video samples;
  
  means for training the localization model based on a second set of labeled video samples;
  
  means for determining based on a trainable selection function, for each unlabeled video sample in a set of unlabeled video samples, a predicted performance improvement of the localization model associated with retraining the localization model;
  
  means for selecting, based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples;
  
  means for receiving an annotation to the first unlabeled video sample, the annotation and the first unlabeled video sample forming a first labeled video sample; and
  
  means for retraining the localization model based on the second set of labeled video samples and the first labeled video sample.
- View Dependent Claims (13, 14)
- - 13. The system of claim 12, wherein:
    - the means for training the selection function comprises means for determining a performance prediction model that maps a video sample and a current localization model to a predicted performance improvement for the current localization model associated with retraining the current localization model based on the video sample; and
      
      the means for determining the predicted performance improvement of the localization model comprises means for determining the predicted performance improvement based on the performance prediction model.
  - 14. The system of claim 13, wherein the means for determining the performance prediction model comprises:
    - means for splitting the first set of labeled video samples into a first subset of video samples and a second subset of video samples;
      
      means for training the current localization model based on the first subset of video samples;
      
      means for determining a performance of the current localization model on a set of test video samples;
      
      means for retraining, for each video sample in the second subset of video samples, the current localization model based on the first subset of video samples and the video sample in the second subset, wherein a new localization model is generated upon completion of the retraining;
      
      means for determining, for each video sample in the second subset of video samples, a performance of the new localization model on the set of test video samples; and
      
      means for determining, for each video sample in the second subset of video samples, a performance improvement based on the performance of the new localization model and the performance of the current localization model; and
      
      means for determining the performance prediction model based on parameters of the current localization model, parameters of each video sample in the second subset of video samples, and the performance improvement for each video sample in the second subset of video samples.

15. A computer-readable non-transitory storage medium storing computer-executable instructions for training a localization model that comprises a neural network and identifies a temporal location of an action in a video stream, wherein the instructions, when executed by one or more processing devices, cause the one or more processing devices to perform operations comprising:
- training the localization model based on a set of labeled video samples;
  
  for each unlabeled video sample in a set of unlabeled video samples, determining, based on a trainable selection function, a predicted performance improvement of the localization model associated with retraining the localization model;
  
  selecting, based on the predicted performance improvement of the localization model, a first unlabeled video sample from the set of unlabeled video samples;
  
  receiving an annotation to the first unlabeled video sample, the annotation and the first unlabeled video sample forming a first labeled video sample; and
  
  retraining the localization model based on the set of labeled video samples and the first labeled video sample.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable non-transitory storage medium of claim 15, wherein training the localization model comprises, for each labeled video sample in the set of labeled video samples:
    - extracting feature vectors for a plurality of temporal segments in the labeled video sample;
      
      selecting, based on the extracted feature vectors, a temporal segment that is estimated to be associated with an action;
      
      classifying the action associated with the selected temporal segment; and
      
      comparing the classified action and timing associated with the selected temporal segment with a label associated with the labeled video sample.
  - 17. The computer-readable non-transitory storage medium of claim 15, wherein the operations further comprise:
    - training the trainable selection function based on a second set of labeled video samples.
  - 18. The computer-readable non-transitory storage medium of claim 17, wherein:
    - training the trainable selection function comprises determining a performance prediction model that maps a video sample and a current localization model to a predicted performance improvement for the current localization model associated with retraining the current localization model based on the video sample; and
      
      determining the predicted performance improvement of the localization model comprises determining the predicted performance improvement based on the performance prediction model.
  - 19. The computer-readable non-transitory storage medium of claim 18, wherein determining the performance prediction model comprises:
    - splitting the second set of labeled video samples into a first subset of video samples and a second subset of video samples;
      
      training the current localization model based on the first subset of video samples;
      
      determining a performance of the current localization model on a set of test video samples;
      
      for each video sample in the second subset of video samples,retraining the current localization model based on the first subset of video samples and the video sample in the second subset, wherein a new localization model is generated upon completion of the retraining;
      
      determining a performance of the new localization model on the set of test video samples; and
      
      determining a performance improvement based on the performance of the new localization model and the performance of the current localization model; and
      
      determining the performance prediction model based on parameters of the current localization model, parameters of each video sample in the second subset of video samples, and the performance improvement for each video sample in the second subset of video samples.
  - 20. The computer-readable non-transitory storage medium of claim 19, wherein determining the performance prediction model comprises:
    - performing a regression learning process using the performance improvement for each video sample in the second subset of video samples as a target vector, and using the parameters of the current localization model and the parameters of each video sample in the second subset of video samples as a feature matrix.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Adobe Inc.
Original Assignee
Adobe Inc.
Inventors
Lee, Joon-Young, Jin, Hailin, Caba Heilbron, Fabian David

Granted Patent

US 10,726,313 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 18/24   Classification techniques

G06N 3/044   Recurrent networks, e.g. Ho...

G06N 3/045   Combinations of networks

G06N 3/08   Learning methods

G06N 7/01   Probabilistic graphical mod...

G06V 10/82   using neural networks

G06V 20/41   Higher-level, semantic clus...

G06V 20/49   Segmenting video sequences,...

G06V 30/1914   Determining representative ...

G06V 30/19167   Active pattern learning

ACTIVE LEARNING METHOD FOR TEMPORAL ACTION LOCALIZATION IN UNTRIMMED VIDEOS

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

ACTIVE LEARNING METHOD FOR TEMPORAL ACTION LOCALIZATION IN UNTRIMMED VIDEOS

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links