Automatic face annotation method and system

US 9,176,987 B1
Filed: 08/26/2014
Issued: 11/03/2015
Est. Priority Date: 08/26/2014
Status: Active Grant

First Claim

Patent Images

1. An automatic face annotation method, comprising:

dividing an input video into different sets of frames;

extracting temporal and spatial information by employing camera take and shot boundary detection algorithms on the different sets of frames of the input video;

collecting weakly labeled data by crawling weakly labeled face images from social networks;

applying face detection together with an iterative refinement clustering algorithm to remove noise of the collected weakly labeled data;

generating a labeled database containing refined labeled images as training data;

based on the refined labeled images stored in the labeled database, finding and labeling exact frames containing one or more face images in the input video matching any of the refined labeled images in the labeled database;

labeling remaining unlabeled face tracks in the input video by a semi-supervised learning algorithm to annotate the face images in the input video; and

outputting the input video containing the annotated face images.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic face annotation method is provided. The method includes dividing an input video into different sets of frames, extracting temporal and spatial information by employing camera take and shot boundary detection algorithms on the different sets of frames, and collecting weakly labeled data by crawling weakly labeled face images from social networks. The method also includes applying face detection together with an iterative refinement clustering algorithm to remove noise of the collected weakly labeled data, generating a labeled database containing refined labeled images, finding and labeling exact frames containing one or more face images in the input video matching any of the refined labeled images based on the labeled database, labeling remaining unlabeled face tracks in the input video by a semi-supervised learning algorithm to annotate the face images in the input video, and outputting the input video containing the annotated face images.

Citations

20 Claims

1. An automatic face annotation method, comprising:
- dividing an input video into different sets of frames;
  
  extracting temporal and spatial information by employing camera take and shot boundary detection algorithms on the different sets of frames of the input video;
  
  collecting weakly labeled data by crawling weakly labeled face images from social networks;
  
  applying face detection together with an iterative refinement clustering algorithm to remove noise of the collected weakly labeled data;
  
  generating a labeled database containing refined labeled images as training data;
  
  based on the refined labeled images stored in the labeled database, finding and labeling exact frames containing one or more face images in the input video matching any of the refined labeled images in the labeled database;
  
  labeling remaining unlabeled face tracks in the input video by a semi-supervised learning algorithm to annotate the face images in the input video; and
  
  outputting the input video containing the annotated face images.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 18, 19, 20)
- - 2. The method according to claim 1, wherein extracting temporal and spatial information by employing camera take and shot boundary detection algorithms on the different sets of frames of the input video further includes:
    - separating an original interleaved format into a number of sequences with each corresponding to a registered camera that is aligned to an original camera setup by registering each camera from incoming video frames;
      
      finding automatically location of faces in a sequence of video frames; and
      
      extracting face tracks from the video by processing each frame within each camera take.
  - 3. The method according to claim 2, wherein extracting face tracks from the video by processing each frame within each camera take further includes:
    - initializing a new face track by a first detected face for each camera take;
      
      for remaining frames of each camera take, when a distance between two detected faces from consecutive frames passes a pre-defined threshold, initializing a new face track; and
      
      for remaining frames of each camera take, when the distance between two detected faces from consecutive frames does not pass the pre-defined threshold, adding this face to a current face track.
  - 4. The method according to claim 1, wherein collecting weakly labeled data and applying face detection together with an iterative refinement clustering algorithm to remove noise further includes:
    - querying key words from one of social networks and an in-house database;
      
      finding automatically location of the faces in each image in a set obtained from the social networks;
      
      obtaining pure movie-relevant face images for each actor by filtering out the noise;
      
      storing the obtained movie-relevant face images;
      
      refining the labeled face images using the iterative refinement clustering algorithm; and
      
      storing the refined labeled face images in the labeled database as the training data.
  - 5. The method according to claim 2, wherein separating an original interleaved format into a number of sequences with each corresponding to a registered camera that is aligned to an original camera setup by registering each camera from incoming video frames further includes:
    - calculating frame difference using color as a measurement of similarity between two frames;
      
      detecting a number of video shots in a video sequence;
      
      selecting a key frame that represents visual content of a shot; and
      
      identifying the camera take.
  - 6. The method according to claim 5, wherein detecting a number of video shots in a video sequence further includes:
    - when the frame difference is above a preset threshold, claiming a new shot, wherein selection of the preset threshold depends on types of video programs, and certain constraints are applied in order to determine a threshold and further refine detection results.
  - 7. The method according to claim 5, further including:
    - matching each detected shot with a last shot in each detected camera take, wherein each detected shot is represented by the key frame; and
      
      when certain matching criterion is satisfied, adding a current shot to an end of a matched camera take.
  - 8. The method according to claim 1, wherein:
    - provided that S represents signature face features from a set of all face tracks with P features;
      
      S_i,jrepresents a value of a jth feature dimension for a ith face track signature;
      
      K represents a total number of final clusters; and
      
      C_idenotes a cluster label to which face i is assigned, an objective function of the signature face features without considering any constraint is defined by;
  - 9. The method according to claim 8, wherein:
    - provided that, in each video frame F_i, B_i,x(i=1, . . . , N) is a bounding box with i representing a frame index;
      
      C_xrepresents the assigned cluster label, a constraint as “
      
      cannot-link”
      
      faces is defined by;
      
      C_x₁≠
      
      C_x₂, when x₁≠
      
      x₂for any given B_i,x(i=1, . . . ,N).
  - 10. The method according to claim 8, wherein:
    - provided that Overlap is a function to measure how much two bounding boxes overlapped;
      
      θ
      
      is a pre-set threshold for determining whether the two boxes are overlap;
      
      CameraTake is an indicator function that depends on whether two frames are from a same camera take, a constraint as “
      
      must-link”
      
      faces is defined by;
      
      C_x₁=C_x₂, when Overlap(B_i₁_,x₁,B_i₂_,x₂)≦
      
      θ and
      
      CameraTake(i₁,i₂)=1.
  - 18. The system according to claim 10, wherein:
    - provided that S represents signature face features from a set of all face tracks with P features;
      
      S_i,jrepresents a value of a jth feature dimension for a ith face track signature;
      
      K represents a total number of final clusters; and
      
      C_idenotes a cluster label to which face i is assigned, an objective function of the signature face features without considering any constraint is defined by;
  - 19. The system according to claim 18, wherein:
    - provided that, in each video frame F_i, B_i,x(i=1, . . . , N) is a bounding box with i representing a frame index;
      
      C_xrepresents the assigned cluster label, a constraint as “
      
      cannot-link”
      
      faces is defined by;
      
      C_x₁≠
      
      C_x₂, when x₁≠
      
      x₂for any given B_i,x(i=1, . . . ,N).
  - 20. The system according to claim 18, wherein:
    - provided that Overlap is a function to measure how much two bounding boxes overlapped;
      
      θ
      
      is a pre-set threshold for determining whether the two boxes are overlap;
      
      CameraTake is an indicator function that depends on whether two frames are from a same camera take, a constraint as “
      
      must-link”
      
      faces is defined by;
      
      C_x₁=C_x₂, when Overlap(B_i₁_,x₂,N_i₂_,x₂)≦
      
      θ and
      
      CameraTake(i₁,i₂)=1.

11. An automatic face annotation system, comprising:
- a camera take detection module configured to extract temporal and spatial information by employing camera take and shot boundary detection algorithms on different sets of frames of an input video;
  
  a social web data analysis module configured to collect weakly labeled data by crawling weakly labeled face images from social networks, apply face detection together with an iterative refinement clustering algorithm to remove noise and generate a labeled database containing refined labeled images as training data;
  
  a face matching module configured to, based on the refined labeled images stored in the labeled database, find and label exact frames containing one or more face images in the input video matching any of the refined labeled images in the labeled database;
  
  an active semi-supervised learning module configured to label remaining unlabeled face tracks in the input video by a semi-supervised learning algorithm to annotate the face images in the input video; and
  
  an output module configured to output the input video containing the annotated face images.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The system according to claim 11, wherein the camera take detection module further includes:
    - a camera take submodule configured to separate an original interleaved format into a number of sequences with each corresponding to a registered camera that is aligned to an original camera setup by registering each camera from incoming video frames;
      
      a face detection submodule configured to find automatically location of faces in a sequence of video frames; and
      
      a face track submodule configured to extract face tracks from the video by processing each frame within each camera take.
  - 13. The system according to claim 12, wherein the face track submodule is further configured to:
    - initialize a new face track by a first detected face for each camera take;
      
      for remaining frames of each camera take, when a distance between two detected faces from consecutive frames passes a pre-defined threshold, initialize a new face track; and
      
      for remaining frames of each camera take, when the distance between two detected faces from consecutive frames does not pass the pre-defined threshold, add this face to a current face track.
  - 14. The system according to claim 11, wherein the social web data analysis module further includes:
    - a search engine configured to query key words from one of social networks and an in-house database;
      
      a face detection submodule configured to find automatically location of the faces in each image in a set obtained from the social networks;
      
      a weakly labeled face submodule configured to obtain pure movie-relevant face images for each actor by filtering out the noise and store the obtained face images;
      
      an iterative refinement clustering submodule configured to refine the labeled face images using the iterative refinement clustering algorithm; and
      
      a refined labeled face submodule configured to store the refined labeled face images.
  - 15. The system according to claim 11, wherein the camera take submodule is further configured to:
    - calculate frame difference using color as a measurement of similarity between two frames;
      
      detect a number of video shots in a video sequence;
      
      select a key frame that represents visual content of a shot; and
      
      identify the camera take.
  - 16. The system according to claim 15, wherein:
    - when the frame difference is above a preset threshold, a new shot is claimed, wherein selection of the preset threshold depends on types of video programs, and certain constraints are applied in order to determine a threshold and further refine detection results.
  - 17. The system according to claim 15, wherein:
    - each detected shot is matched with a last shot in each detected camera take, wherein each detected shot is represented by the key frame; and
      
      when certain matching criterion is satisfied, a current shot is added to an end of a matched camera take.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
TCL Research America, Inc. (TCL Technology Group Corp.)
Original Assignee
TCL Research America, Inc. (TCL Technology Group Corp.)
Inventors
Peng, Liang, Wang, Haohong, Yang, Yimin
Primary Examiner(s)
Kassa, Yosef

Application Number

US14/468,800
Time in Patent Office

434 Days
Field of Search

382/115, 382/118, 382/128, 382/154, 382/209, 382/278, 340/5.81, 340/5.83
US Class Current

1/1
CPC Class Codes

G06F 16/21   Design, administration or m...

G06F 16/5846   using extracted text

G06F 16/951   Indexing; Web crawling tech...

G06F 18/2411   based on the proximity to a...

G06F 40/169   Annotation, e.g. comment da...

G06T 2207/10016   Video; Image sequence

G06T 2207/30201   Face

G06T 5/70   Denoising; Smoothing

G06V 20/30   in albums, collections or s...

G06V 20/49   Segmenting video sequences,...

G06V 40/161   Detection; Localisation; No...

G06V 40/172   Classification, e.g. identi...

Automatic face annotation method and system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic face annotation method and system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links