Video retrieval system for human face content

US 7,881,505 B2
Filed: 09/29/2006
Issued: 02/01/2011
Est. Priority Date: 09/29/2006
Status: Active Grant

First Claim

Patent Images

1. A method for processing video data, comprising:

detecting human faces in a plurality of video frames in said video data using a processor;

for at least one detected human face, identifying a face-specific set of video frames using said processor irrespective of whether said detected human face is present in said face-specific set of video frames in a substantially temporally continuous manner;

grouping all video frames in said face-specific set of video frames into a plurality of face tracks using said processor, wherein each face track contains corresponding one or more video frames having at least a substantial temporal continuity therebetween;

segmenting pixels associated with said at least one detected human face in each video frame in said face-specific set of video frames using said processor so as to extract color signature of said at least one detected human face in each said face-specific video frame;

using said processor, merging two or more of said plurality of face tracks that are disjoint in time based on a comparison of the color signatures of said at least one detected human face appearing in video frames constituting said two or more of said plurality of face tracks; and

enabling a user to view on an electronic display for said processor face-specific video segments of said at least one detected human face in said video data based on said merging of temporally disjoint face tracks.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for video retrieval and cueing that automatically detects human faces in the video and identifies face-specific video frames so as to allow retrieval and viewing of person-specific video segments. In one embodiment, the method locates human faces in the video, stores the time stamps associated with each face, displays a single image associated with each face, matches each face against a database, computes face locations with respect to a common 3D coordinate system, and provides a means of displaying: 1) information retrieved from the database associated with a selected person or people, 2) path of travel associated with a selected person or people 3) interaction graph of people in video, 4) video segments associated with each person and/or face. The method may also provide the ability to input and store text annotations associated with each person, face, and video segment, and the ability to enroll and remove people from database. The videos of non-human objects may be processed in a similar manner. Because of the rules governing abstracts, this abstract should not be used to construe the claims.

Citations

38 Claims

1. A method for processing video data, comprising:
- detecting human faces in a plurality of video frames in said video data using a processor;
  
  for at least one detected human face, identifying a face-specific set of video frames using said processor irrespective of whether said detected human face is present in said face-specific set of video frames in a substantially temporally continuous manner;
  
  grouping all video frames in said face-specific set of video frames into a plurality of face tracks using said processor, wherein each face track contains corresponding one or more video frames having at least a substantial temporal continuity therebetween;
  
  segmenting pixels associated with said at least one detected human face in each video frame in said face-specific set of video frames using said processor so as to extract color signature of said at least one detected human face in each said face-specific video frame;
  
  using said processor, merging two or more of said plurality of face tracks that are disjoint in time based on a comparison of the color signatures of said at least one detected human face appearing in video frames constituting said two or more of said plurality of face tracks; and
  
  enabling a user to view on an electronic display for said processor face-specific video segments of said at least one detected human face in said video data based on said merging of temporally disjoint face tracks.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 26, 27, 28)
- - 2. The method of claim 1, wherein said grouping is carried out in a temporally sequential manner based on respective time stamps associated with said video frames in each said face-specific set of video frames.
  - 3. The method of claim 1, further comprising:
    - displaying a representative image for said grouped video frames on said electronic display.
  - 4. The method of claim 1, further comprising:
    - allowing said user to manually associate respective grouped video frames in said face-specific set of video frames with an image entry stored in a database using said processor.
  - 5. The method of claim 1, further comprising:
    - allowing said user to manually override a match between respective grouped video frames in said face-specific set of video frames and an image entry stored in a database using said processor.
  - 6. The method of claim 1, further comprising:
    - matching all grouped video frames with image entries stored in a database using said processor; and
      
      using said processor enrolling unmatched grouped video frames into said database through corresponding image entries.
  - 7. The method of claim 1, further comprising:
    - indicating using said processor one or more unmatched human faces in said detected human faces based on a comparison of said detected human faces against a plurality of human face images stored in a database; and
      
      enabling said user to view on said electronic display those face-specific video segments wherein said one or more unmatched human faces are present.
  - 8. The method of claim 1, further comprising:
    - displaying on said electronic display a representative image for at least one video frame in said face-specific set of video frames for said at least one detected human face.
  - 9. The method of claim 8, further comprising:
    - enabling said user to view said face-specific video segments on said electronic display using said representative image as a link therefor.
  - 10. The method of claim 8, further comprising:
    - retrieving a textual description for said face-specific video segments from a database using said processor; and
      
      displaying said textual description along with said representative image on said electronic display.
  - 11. The method of claim 1, further comprising:
    - enabling said user to input a textual description of said face-specific video segments associated with said at least one detected human face using said processor.
  - 12. The method of claim 1, wherein said identifying includes using face recognition to identify said face-specific set of video frames for said at least one detected human face.
  - 13. The method of claim 1, further comprising:
    - automatically displaying said face-specific video segments on said electronic display upon identification of said face-specific set of video frames for said at least one detected human face.
  - 14. The method of claim 1, further comprising:
    - using said processor, determining movement of said at least one detected human face in said face-specific video segments associated therewith using a three-dimensional coordinate system.
  - 15. The method of claim 14, further comprising:
    - displaying said movement of said at least one detected human face with respect to a map on said electronic display.
  - 16. The method of claim 1, further comprising:
    - displaying a co-occurrence of two human faces in said plurality of video frames as a link graph on said electronic display, wherein said link graph includes a plurality of nodes, and wherein each node in said link graph represents a different detected human face in said plurality of video frames regardless of identification status of said detected human face.
  - 17. The method of claim 16, wherein said link graph includes a plurality of dimensionally-weighted links, wherein each link connects a pair of nodes from said plurality of nodes, and wherein weighting of each said link is proportional to the amount of interaction between two humans represented as nodes connected by said link.
  - 26. The method of claim 1, wherein said merging of said two or more of said plurality of face tracks includes:
    - computing a first Mahalanobis distance of a first face track in said plurality of face tracks using a first mean color value of pixels associated with video frames constituting said first face track, a second mean color value of pixels associated with video frames constituting a second face track in said plurality of face tracks, and a first covariance of pixels associated with video frames constituting said second face track;
      
      computing a second Mahalanobis distance of said second face track using said first mean color value, said second mean color value, and a second covariance of pixels associated with video frames constituting said first face track; and
      
      merging said first and said second face tracks when the sum of said first and said second Mahalanobis distances is less than a predetermined threshold.
  - 27. The method of claim 26, wherein said first and said second Mahalanobis distances are computed using the equation:
    - d_j=(m_j=m_i)^tC_i⁼¹(m_j=m_i), wherein “
      
      d_j”
      
      represents Mahalanobis distance of the j^thface track, “
      
      m_j”
      
      represents mean color value of the j^thface track, “
      
      m_i”
      
      represents mean color value of the i^thface track, and “
      
      C_i”
      
      represents covariance of the i^thface track.
  - 28. The method of claim 26, wherein said first and said second Mahalanobis distances are computed by removing a luminance component from a color representation of each pixel associated with video frames constituting said first and said second face tracks.

18. A method for processing video data, comprising:
- detecting objects in a plurality of video frames in said video data using a processor;
  
  for at least one detected object, identifying an object-specific set of video frames using said processor irrespective of whether said detected object is present in said object-specific set of video frames in a substantially temporally continuous manner;
  
  grouping all video frames in said object-specific set of video frames into a plurality of object tracks using said processor, wherein each object track contains corresponding one or more video frames having at least a substantial temporal continuity therebetween;
  
  segmenting pixels associated with said at least one detected object in each video frame in said object-specific set of video frames using said processor so as to extract color signature of said at least one detected object in each said object-specific video frame;
  
  using said processor, merging two or more of said plurality of object tracks that are disjoint in time based on a comparison of the color signatures of said at least one detected object appearing in video frames constituting said two or more of said plurality of object tracks; and
  
  enabling a user to view on an electronic display for said processor object-specific video segments of said at least one detected object in said video data based on said merging of temporally disjoint object tracks.
- View Dependent Claims (29)
- - 29. The method of claim 18, wherein said merging of said two or more of said plurality of object tracks includes:
    - computing a first Mahalanobis distance of a first object track in said plurality of object tracks using a first mean color value of pixels associated with video frames constituting said first object track, a second mean color value of pixels associated with video frames constituting a second object track in said plurality of face tracks, and a first covariance of pixels associated with video frames constituting said second object track;
      
      computing a second Mahalanobis distance of said second object track using said first mean color value, said second mean color value, and a second covariance of pixels associated with video frames constituting said first object track,wherein said first and said second Mahalanobis distances are computed using the equation;
      
      d_j=(m_j−
      
      m_i)^tC_i^−
      
      1(m_j−
      
      m_i, wherein “
      
      d_j”
      
      represents Mahalanobis distance of the j^thobject track, “
      
      m_j”
      
      represents mean color value of the j^thobject track, “
      
      m_i”
      
      represents mean color value of the i^thobject track, and “
      
      C_i”
      
      represents covariance of the i^thobject track, andwherein said first and said second Mahalanobis distances are computed by removing a luminance component from a color representation of each pixel associated with video frames constituting said first and said second object tracks; and
      
      merging said first and said second object tracks when the sum of said first and said second Mahalanobis distances is less than a predetermined threshold.

19. A data storage medium containing a program code, which, when executed by a processor, causes said processor to perform the following:
- receive video data;
  
  detect human faces in a plurality of video frames in said video data;
  
  for at least one detected human face, identify a face-specific set of video frames irrespective of whether said detected human face is present in said face-specific set of video frames in a substantially temporally continuous manner;
  
  group all video frames in said face-specific set of video frames into a plurality of face tracks, wherein each face track contains corresponding one or more video frames having at least a substantial temporal continuity therebetween;
  
  segment pixels associated with said at least one detected human face in each video frame in said face-specific set of video frames so as to extract color signature of said at least one detected human face in each said face-specific video frame;
  
  merge two or more of said plurality of face tracks that are disjoint in time based on a comparison of the color signatures of said at least one detected human face appearing in video frames constituting said two or more of said plurality of face tracks; and
  
  enable a user to view face-specific video segments of said at least one detected human face in said video data based on said merger of temporally disjoint face tracks.
- View Dependent Claims (20, 21, 22, 30, 31, 32)
- - 20. The data storage medium of claim 19, wherein said program code, upon execution by said processor, causes said processor to further perform the following:
    - indicate one or more unmatched human faces in said detected human faces based on a comparison of said detected human faces against a plurality of human face images stored in a database; and
      
      track at least one unmatched human face across said video data in substantially real time through said face-specific set of video frames therefor.
  - 21. The data storage medium of claim 20, wherein said program code, upon execution by said processor, causes said processor to further perform the following:
    - automatically display face-specific video segments associated with said at least one unmatched human face based on said face-specific set of video frames therefor.
  - 22. The data storage medium of claim 20, wherein said program code, upon execution by said processor, causes said processor to further perform the following:
    - display a cueing link for said face-specific set of video frames associated with said at least one unmatched human face so as to enable said user to view only those face-specific video segments in said video data wherein said at least one unmatched human face appears without requiring said user to search said video data for said video segments of said at least one unmatched human face.
  - 30. The data storage medium of claim 19, wherein said program code, upon execution by said processor, causes said processor to merge said two or more of said plurality of face tracks by:
    - computing a first Mahalanobis distance of a first face track in said plurality of face tracks using a first mean color value of pixels associated with video frames constituting said first face track, a second mean color value of pixels associated with video frames constituting a second face track in said plurality of face tracks, and a first covariance of pixels associated with video frames constituting said second face track;
      
      computing a second Mahalanobis distance of said second face track using said first mean color value, said second mean color value, and a second covariance of pixels associated with video frames constituting said first face track; and
      
      merging said first and said second face tracks when the sum of said first and said second Mahalanobis distances is less than a predetermined threshold.
  - 31. The data storage medium of claim 30, wherein said program code, upon execution by said processor, causes said processor to compute said first and said second Mahalanobis distances using the equation:
    - d_j=(m_j=m_i)^tC_i⁼¹(m_j=m_i), wherein “
      
      d_j”
      
      represents Mahalanobis distance of the j^thface track, “
      
      m_j”
      
      represents mean color value of the j^thface track, “
      
      m_i”
      
      represents mean color value of the i^thface track, and “
      
      C_i”
      
      represents covariance of the i^thface track.
  - 32. The data storage medium of claim 30, wherein said program code, upon execution by said processor, causes said processor to compute said first and said second Mahalanobis distances by removing a luminance component from a color representation of each pixel associated with video frames constituting said first and said second face tracks.

23. A system for processing video data, comprising:
- means for detecting human faces in a plurality of video frames in said video data;
  
  for at least one detected human face, means for identifying a face-specific set of video frames irrespective of whether said detected human face is present in said face-specific set of video frames in a substantially temporally continuous manner;
  
  means for grouping all video frames in said face-specific set of video frames into a plurality of face tracks, wherein each face track contains corresponding one or more video frames having at least a substantial temporal continuity therebetween;
  
  means for segmenting pixels associated with said at least one detected human face in each video frame in said face-specific set of video frames so as to extract color signature of said at least one detected human face in each said face-specific video frame;
  
  means for merging two or more of said plurality of face tracks that are disjoint in time based on a comparison of the color signatures of said at least one detected human face appearing in video frames constituting said two or more of said plurality of face tracks; and
  
  means for displaying face-specific video segments of said at least one detected human face in said video data based on said merger of temporally disjoint face tracks.
- View Dependent Claims (24, 33, 34, 35)
- - 24. The system of claim 23, further comprising:
    - means for indicating one or more unmatched human faces in said detected human faces;
      
      means for identifying those portions of said video data wherein said one or more unmatched human faces are present; and
      
      means for automatically displaying face-specific video segments in said video data associated with said one or more unmatched human faces based on said video data portions identified for said one or more unmatched human faces.
  - 33. The system of claim 23, wherein said means for merging includes:
    - first means for computing a first Mahalanobis distance of a first face track in said plurality of face tracks using a first mean color value of pixels associated with video frames constituting said first face track, a second mean color value of pixels associated with video frames constituting a second face track in said plurality of face tracks, and a first covariance of pixels associated with video frames constituting said second face track;
      
      second means for computing a second Mahalanobis distance of said second face track using said first mean color value, said second mean color value, and a second covariance of pixels associated with video frames constituting said first face track; and
      
      means for merging said first and said second face tracks when the sum of said first and said second Mahalanobis distances is less than a predetermined threshold.
  - 34. The system of claim 33, wherein said first and said second means for computing are configured to compute said first and said second Mahalanobis distances using the equation:
    - d_j=(m_j=m_i)^tC_i⁼¹(m_j=m_i), wherein “
      
      d_j”
      
      represents Mahalanobis distance of the j^thface track, “
      
      m_j”
      
      represents mean color value of the j^thface track, “
      
      m_i”
      
      represents mean color value of the i^thface track, and “
      
      C_i”
      
      represents covariance of the i^thface track.
  - 35. The system of claim 33, wherein said first and said second means for computing are configured to compute said first and said second Mahalanobis distances by removing a luminance component from a color representation of each pixel associated with video frames constituting said first and said second face tracks.

25. A computer system, which, upon being programmed, is configured to perform the following:
- receive video data;
  
  detect human faces in a plurality of video frames in said video data;
  
  for at least one detected human face, identify a face-specific set of video frames irrespective of whether said detected human face is present in said face-specific set of video frames in a substantially temporally continuous manner;
  
  group all video frames in said face-specific set of video frames into a plurality of face tracks, wherein each face track contains corresponding one or more video frames having at least a substantial temporal continuity therebetween;
  
  segment pixels associated with said at least one detected human face in each video frame in said face-specific set of video frames so as to extract color signature of said at least one detected human face in each said face-specific video frame;
  
  merge two or more of said plurality of face tracks that are disjoint in time based on a comparison of the color signatures of said at least one detected human face appearing in video frames constituting said two or more of said plurality of face tracks; and
  
  enable a user to view face-specific video segments of said at least one detected human face in said video data based on said merger of temporally disjoint face tracks.
- View Dependent Claims (36, 37, 38)
- - 36. The computer system of claim 25, which, upon being programmed, is configured to merge said two or more of said plurality of face tracks by:
    - computing a first Mahalanobis distance of a first face track in said plurality of face tracks using a first mean color value of pixels associated with video frames constituting said first face track, a second mean color value of pixels associated with video frames constituting a second face track in said plurality of face tracks, and a first covariance of pixels associated with video frames constituting said second face track;
      
      computing a second Mahalanobis distance of said second face track using said first mean color value, said second mean color value, and a second covariance of pixels associated with video frames constituting said first face track; and
      
      merging said first and said second face tracks when the sum of said first and said second Mahalanobis distances is less than a predetermined threshold.
  - 37. The computer system of claim 36, which, upon being programmed, is configured to compute said first and said second Mahalanobis distances using the equation:
    - d_j=(m_j=m_i)^tC_i⁼¹(m_j=m_i), wherein “
      
      d_j”
      
      represents Mahalanobis distance of the j^thface track, “
      
      m_j”
      
      represents mean color value of the j^thface track, “
      
      m_i”
      
      represents mean color value of the i^thface track, and “
      
      C_i”
      
      represents covariance of the i^thface track.
  - 38. The computer system of claim 36, which, upon being programmed, is configured to compute said first and said second Mahalanobis distances by removing a luminance component from a color representation of each pixel associated with video frames constituting said first and said second face tracks.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Pittsburgh Pattern Recognition, Inc. (Alphabet Inc.)
Inventors
Rodriguez, Uriel G., Brandy, Louis D., Nechyba, Michael C., Schneiderman, Henry
Primary Examiner(s)
Ahmed; Samir A.
Assistant Examiner(s)
HU, FRED H.

Application Number

US11/540,619
Publication Number

US 20080080743A1
Time in Patent Office

1,586 Days
Field of Search

382/118
US Class Current

382/118
CPC Class Codes

G06F 16/784   the detected or recognised ...

G06V 40/173   face re-identification, e.g...

G08B 13/196   using television cameras

G11B 27/105   of operating discs

G11B 27/28   by using information signal...

G11B 27/3027   used signal is digitally coded

G11B 27/34   Indicating arrangements in...

Video retrieval system for human face content

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

38 Claims

Specification

Solutions

Use Cases

Quick Links

Video retrieval system for human face content

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

38 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links