Method and system for person identification using video-speech matching

US 20030154084A1
Filed: 02/14/2002
Published: 08/14/2003
Est. Priority Date: 02/14/2002
Status: Abandoned Application

First Claim

Patent Images

1. An audio-visual system for processing video data comprising:

an object detection module capable of providing a plurality of object features from the video data;

an audio processor module capable of providing a plurality of audio features from the video data;

a processor coupled to the object detection and the audio segmentation modules, wherein the processor is arranged determine a correlation between the plurality of object features and the plurality of audio features.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system are disclosed for determining who is the speaking person in video data. This may be used to add in person identification in video content analysis and retrieval applications. A correlation is used to improve the person recognition rate relying on both face recognition and speaker identification. Latent Semantic Association (LSA) process may also be used to improve the association of a speaker'"'"'s face with his voice. Other sources of data (e.g., text) may be integrated for a broader domain of video content understanding applications.

82 Citations

View as Search Results

20 Claims

1. An audio-visual system for processing video data comprising:
- an object detection module capable of providing a plurality of object features from the video data;
  
  an audio processor module capable of providing a plurality of audio features from the video data;
  
  a processor coupled to the object detection and the audio segmentation modules, wherein the processor is arranged determine a correlation between the plurality of object features and the plurality of audio features.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the processor is further arranged to determine whether an animated object in the video data is associated with audio.
  - 3. The system of claim 2, wherein the plurality of audio features comprise two or more of the following average energy, pitch, zero crossing, bandwidth, band central, roll off, low ratio, spectral flux and 12 MFCC components.
  - 4. The system of claim 2, wherein the animated object is a face and the processor is arranged to determine whether the face is speaking.
  - 5. The system of claim 4, wherein the plurality of image features are eigenfaces that represent global features of the face.
  - 6. The system of claim 1, further comprising a latent semantic indexing module coupled to the processor and that preprocesses the plurality of object features and the plurality of audio features before the correlation is performed.
  - 7. The system of claim 6, wherein the latent semantic indexing module includes a singular value decomposition module.

8. A method for identifying a speaking person within video data, the method comprising the steps of:
- receiving video data including image and audio information;
  
  determining a plurality of face image features from one or more faces in the video data;
  
  determining a plurality of audio features related to audio information;
  
  calculating a correlation between the plurality of face image features and the audio features; and
  
  determining the speaking person based upon the correlation.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method according to claim 8, further comprising the step of normalizing the face image features and the audio features.
  - 10. The method according to claim 9, further comprising the step of performing a singular value decomposition on the normalized face image features and the audio features.
  - 11. The method according to claim 8, wherein the determining step includes determining the speaking person based upon the one or more faces that has the largest correlation.
  - 12. The method according to claim 10, wherein the calculating step includes forming a matrix of the face image features and the audio features.
  - 13. The method according to claim 12, further comprising the step of performing an optimal approximate fit using smaller matrices as compared to full rank matrices formed by the face image features and the audio features.
  - 14. The method according to claim 13, wherein the rank of the smaller matrices is chosen to remove noise and unrelated information from the full rank matrices.

15. A memory medium including code for processing a video including images and audio, the code comprising:
- code to obtain a plurality of object features from the video;
  
  code to obtain a plurality of audio features from the video;
  
  code to determine a correlation between the plurality of object features and the plurality of audio features; and
  
  code to determine an association between one or more objects in the video and the audio.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The memory medium of claim 15, wherein the one or more objects comprises one or more faces.
  - 17. The memory medium of claim 16, further comprising code to determine a speaking face.
  - 18. The memory medium of claim 15, further comprising code create a matrix using the plurality of object features and the audio features and code to perform a singular value decomposition on the matrix.
  - 19. The memory medium of claim 18, further comprising code to perform an optimal approximate fit using smaller matrices as compared to full rank matrices formed by the object features and the audio features.
  - 20. The memory medium according to claim 19, wherein the rank of the smaller matrices is chosen to remove noise and unrelated information from the full rank matrices.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Koninklijke Philips Electronics N.V. (Koninklijke Philips N.V.)
Original Assignee
Koninklijke Philips Electronics N.V. (Koninklijke Philips N.V.)
Inventors
Li, Mingkun, Dimitrova, Nevenka, Li, Dongge

Application Number

US10/076,194
Publication Number

US 20030154084A1
Time in Patent Office

Days
Field of Search
US Class Current

704/273
CPC Class Codes

G06V 40/161   Detection; Localisation; No...

G10L 15/24   Speech recognition using no...

G10L 17/02   Preprocessing operations, e...

Method and system for person identification using video-speech matching

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

82 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for person identification using video-speech matching

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

82 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links