Method and apparatus for annotating a video stream comprising a sequence of frames

US 10,140,508 B2
Filed: 08/26/2016
Issued: 11/27/2018
Est. Priority Date: 08/26/2016
Status: Active Grant

First Claim

Patent Images

1. A method of training an image recognition tool for detecting images of a person:

scanning a first frame in a video stream comprising a sequence of frames for images of a person;

generatinq a representation of the region of interest of the first frame likely to contain the image of the person;

forming a video track comprising the representation of a region of interest of the first frame likely to contain an image of the person;

scanning each subsequent frame in the sequence of frames for images of the person in each subsequent frame, wherein the scanning each frame begins at a location in each frame based on a location of the region of interest of a preceding frame;

for each subsequent frame in the sequence of frames;

generating a representation of the region of interest of the subsequent frame likely to contain the image of the person;

adding, to the video track, the representation of a region of interest of subsequent frame likely to contain the image of the person;

assigning a positive label to the video track when the representation of the region of interest in at least one of the first frame and the subsequent frames contains the person and no other people, the positive label identifying the video track as corresponding to the person; and

designating each representation of the region of interest in the positively labeled video track as a positive instance and providing each representation of the region of interest in the positively labeled video track to the image recognition tool for training a multiple-instance learning algorithm of the image recognition tool.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are disclosed herein for annotating video tracks obtained from video data streams. Video tracks are treated as positive if they contain at least one region of interest containing a particular person, and negative if the video track does not contain a region of interest containing the particular person. Visual similarity models are trained using the positive bags.

20 Citations

View as Search Results

24 Claims

1. A method of training an image recognition tool for detecting images of a person:
- scanning a first frame in a video stream comprising a sequence of frames for images of a person;
  
  generatinq a representation of the region of interest of the first frame likely to contain the image of the person;
  
  forming a video track comprising the representation of a region of interest of the first frame likely to contain an image of the person;
  
  scanning each subsequent frame in the sequence of frames for images of the person in each subsequent frame, wherein the scanning each frame begins at a location in each frame based on a location of the region of interest of a preceding frame;
  
  for each subsequent frame in the sequence of frames;
  
  generating a representation of the region of interest of the subsequent frame likely to contain the image of the person;
  
  adding, to the video track, the representation of a region of interest of subsequent frame likely to contain the image of the person;
  
  assigning a positive label to the video track when the representation of the region of interest in at least one of the first frame and the subsequent frames contains the person and no other people, the positive label identifying the video track as corresponding to the person; and
  
  designating each representation of the region of interest in the positively labeled video track as a positive instance and providing each representation of the region of interest in the positively labeled video track to the image recognition tool for training a multiple-instance learning algorithm of the image recognition tool.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the representation of the region of interest of the first frame comprises a set of coordinates indicating a position and dimensions of the region of interest.
  - 3. The method of claim 1, wherein the representation of the region of interest of the first frame comprises image data extracted from the region of interest of the first frame.
  - 4. The method of claim 1, further comprising:
    - prior to the assigning, displaying the video track on a display device.
  - 5. The method of claim 1, wherein a number of frames containing the person in the labelled video track is less than a total number of frames in the labelled video track.
  - 6. The method of claim 1, wherein the scanning the first frame comprises analyzing pixel data.
  - 7. The method of claim 6, wherein the analyzing comprises computing metadata based on said pixel data.
  - 8. The method of claim 1, wherein the scanning the first frame comprises:
    - analyzing a portion of the first frame contained within a sliding window; and
      
      determining a probability that the portion contains an image of the person.
  - 9. The method of claim 1, further comprising assigning a negative label to the video track when any of the first frame and the subsequent frames do not contain the person and no other people.
  - 10. The method of claim 9, further comprising designating each representation of the region of interest in the negatively labeled video track as a negative instance and providing each representation of the region of interest in the negatively labeled video track to the image recognition tool for training the multiple-instance learning algorithm of the image recognition tool.
  - 11. The method of claim 1, comprising receiving a second video track, assigning a positive label to the second video track when the representation of the region of interest in at least one of the first frame and the subsequent frames of the second track contains the person and no other people, the positive label identifying the second video track as corresponding to the person, and designating each representation of the region of interest in the second positively labeled video track as a positive instance and providing each representation of the region of interest in the second positively labeled video track to the image recognition tool for training the multiple-instance learning algorithm of the image recognition tool.
  - 12. The method of claim 11, wherein the second video track is formed by:
    - scanning a first frame in a second video stream comprising a sequence of frames for images of the person;
      
      generating a representation of the region of interest of the first frame in the second video stream likely to contain the image of the person;
      
      forming the second video track, comprising the representation of a region of interest of the first frame likely to contain an image of the person;
      
      scanning each subsequent frame in the sequence of frames for images of the person in each subsequent frame, wherein the scanning each frame begins at a location in each frame based on a location of the region of interest of a preceding frame; and
      
      for each subsequent frame in the second sequence of frames;
      
      generating a representation of the region of interest of the subsequent frame likely to contain the image of the person; and
      
      adding, to the second video track, the representation of a region of interest of the subsequent frame likely to contain the image of the person.

13. A system for training an image recognition tool for detecting images of a person, the system comprising:
- a processor;
  
  a memory containing computer-readable instructions for execution by said processor, said instructions comprising;
  
  video analytics instructions for producing a video track, the video analytics instructions comprising;
  
  human body detection instructions for scanning image data in a video stream comprising a sequence of frames for a person and generating representations of regions of interest of frames in the sequence of frames likely to contain the image of the person;
  
  visual feature extraction instructions for adding, to the video track, representations of regions of interest of the sequence of frames likely to contain the person;
  
  human body tracking instructions for determining a starting location for said scanning in frames of said sequence based on a location of a region of interest in a preceding frame;
  
  labeling instructions for assigning a positive label to the video track when the representation of the region of interest in at least one of the first frame and the subsequent frames contains the person and no other people, the positive label identifying the video track as corresponding to the person;
  
  training instructions for designating each representation of the region of interest in the positively labeled video track as a positive instance and providing each representation of the region of interest in the positively labeled video track to the image recognition tool for training a multiple-instance learning algorithm of the image recognition tool; and
  
  a storage for storing the positively labeled video track and the trained image recognition tool.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 14. The system of claim 13, wherein the representations of the regions of interest comprise a set of coordinates indicating a position and dimensions of the region of interest.
  - 15. The system of claim 13, wherein the representation of the region of interest comprises an image of the region of interest extracted from a frame.
  - 16. The system of claim 13, further comprising:
    - a display device for displaying the video track prior to labelling the video track.
  - 17. The system of claim 13, wherein a number of frames containing the particular person in the labelled video track is less than a total number of frames in the labelled video track.
  - 18. The system of claim 13, wherein the scanning comprises analyzing pixel data.
  - 19. The system of claim 18, wherein the analyzing comprises computing metadata based on the pixel data.
  - 20. The system of claim 13, wherein the scanning the first frame comprises:
    - analyzing the image data contained within a sliding window; and
      
      determining a probability that the sliding window contains the person.
  - 21. The system of claim 13, wherein said video analytics instructions further comprise further labeling instructions for assigning a negative label to the video track when any of the first frame and the subsequent frames do not contain the person and no other people.
  - 22. The system of claim 21, wherein said video analytics instructions further comprise further training instructions for designating each representation of the region of interest in the negatively labeled video track as a negative instance and providing each representation of the region of interest in the negatively labeled video track to the image recognition tool for training the multiple-instance learning algorithm of the image recognition tool.
  - 23. The system of claim 13, wherein said video analytics instructions further comprises:
    - further video analytics instructions for producing a second video track, the further video analytics instructions comprising;
      
      further human body detection instructions for scanning image data in a second video stream comprising a sequence of frames for a person and generating representations of regions of interest of frames in the sequence of frames likely to contain the image of the person;
      
      further visual feature extraction instructions for adding, to the second video track, representations of regions of interest of the sequence of frames likely to contain the person;
      
      further human body tracking instructions for determining a starting location for said scanning in frames of said sequence based on a location of a region of interest in a preceding frame;
      
      further labeling instructions for assigning a positive label to the second video track when the representation of the region of interest in at least one of the first frame and the subsequent frames contains the person and no other people, the positive label identifying the second video track as corresponding to the person;
      
      further training instructions for designating each representation of the region of interest in the positively labeled second video track as a positive instance and providing each representation of the region of interest in the positively labeled second video track to the image recognition tool for training a multiple-instance learning algorithm of the image recognition tool.

24. A non-transitory computer-readable storage medium having stored thereon computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform a method of training an image recognition tool for detecting images of a person, the method comprising:
- scanning a first frame in a video stream comprising a sequence of frames for images of a person;
  
  generating a representation of the region of interest of the first frame likely to contain the image of the person;
  
  forming a video track comprising the representation of a region of interest of the first frame likely to contain an image of the person;
  
  scanning each subsequent frame in the sequence of frames for images of the person, wherein the scanning each frame begins at a spatial location in each frame based on a location of the region of interest of a preceding frame;
  
  for each subsequent frame in the sequence of frames;
  
  generating a representation of the region of interest of the subsequent frame likely to contain the image of the person;
  
  adding, to the video track, the representation of a region of interest of the subsequent frame likely to contain the image of the person;
  
  assigning a positive label to the video track when the representation of the region of interest in at least one of the first frame and the subsequent frames contains the person and no other people, the positive label identifying the video track as corresponding to the person; and
  
  designating each representation of the region of interest in the positively labeled video track as a positive instance and providing each representation of the region of interest in the positively labeled video track to the image recognition tool for training a multiple-instance learning algorithm of the image recognition tool.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Huawei Cloud Computing Technologies Company Limited (Huawei Investment & Holding Co., Ltd.)
Original Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Inventors
Zhang, Rui
Primary Examiner(s)
Tran, Thai Q
Assistant Examiner(s)
Smith, Stephen R

Application Number

US15/248,684
Publication Number

US 20180060653A1
Time in Patent Office

823 Days
Field of Search

386278, 386282, 348143-161, 382103, 382181, 382155
US Class Current
CPC Class Codes

G06F 18/22   Matching criteria, e.g. pro...

G06F 18/2413   based on distances to train...

G06T 7/174   involving the use of two or...

G06T 7/20   Analysis of motion motion e...

G06T 7/292   Multi-camera tracking

G06T 7/60   Analysis of geometric attri...

G06V 10/25   Determination of region of ...

G06V 10/454   Integrating the filters int...

G06V 10/761   Proximity, similarity or di...

G06V 10/764   using classification, e.g. ...

G06V 10/82   using neural networks

G06V 20/41   Higher-level, semantic clus...

G06V 20/46   Extracting features or char...

G06V 20/52   Surveillance or monitoring ...

G06V 40/103   Static body considered as a...

G11B 27/031   Electronic editing of digit...

G11B 27/102   Programmed access in sequen...

G11B 27/19   by using information detect...

G11B 27/28   by using information signal...

G11B 27/36   Monitoring, i.e. supervisin...

Method and apparatus for annotating a video stream comprising a sequence of frames

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for annotating a video stream comprising a sequence of frames

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links