Automatic face extraction for use in recorded meetings timelines

US 7,598,975 B2
Filed: 10/30/2004
Issued: 10/06/2009
Est. Priority Date: 06/21/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method, comprising:

detecting two or more facial images in a video sample;

detecting two or more speakers in an audio sample that corresponds to the video sample;

detecting a primary speaker of the two or more speakers;

clustering the two or more speakers temporally and spatially;

storing a speaker timeline for each detected speaker that identifies the speaker by a speaker identifier and a speaker location at various times along a the speaker timeline;

storing at least one facial image for each detected speaker in a faces database; and

associating a speaker timeline and a facial image with each detected speaker.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Faces of speakers in a meeting or conference are automatically detected and facial images corresponding to each speaker are stored in a faces database. A timeline is created to graphically identify when each speaker is speaking during playback of a recording of the meeting. Instead of generically identifying each speaker in the timeline, a facial image is shown to identify each speaker associated with the timeline.

97 Citations

View as Search Results

32 Claims

1. A method, comprising:
- detecting two or more facial images in a video sample;
  
  detecting two or more speakers in an audio sample that corresponds to the video sample;
  
  detecting a primary speaker of the two or more speakers;
  
  clustering the two or more speakers temporally and spatially;
  
  storing a speaker timeline for each detected speaker that identifies the speaker by a speaker identifier and a speaker location at various times along a the speaker timeline;
  
  storing at least one facial image for each detected speaker in a faces database; and
  
  associating a speaker timeline and a facial image with each detected speaker.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method as recited in claim 1, wherein the detecting two or more facial images further comprises using a face tracking to detect the two or more facial images.
  - 3. The method as recited in claim 1, wherein the detecting two or more speakers further comprises using sound source localization to detect the two or more speakers.
  - 4. The method as recited in claim 1, further comprising:
    - identifying more than one facial image for each speaker; and
      
      selecting a best facial image to store in the faces database.
  - 5. The method as recited in claim 4, wherein the selecting further comprises selecting a facial image that includes a most frontal facial view as being the best facial image.
  - 6. The method as recited in claim 4, wherein the selecting further comprises selecting a facial image that exhibits the least motion as being the best facial image.
  - 7. The method as recited in claim 4, wherein the selecting further comprises selecting a facial image that exhibits maximum symmetry as being the best facial image.
  - 8. The method as recited in claim 1, wherein the speaker location is denoted by a speaker bounding box identified by video sample coordinates.
  - 9. The method as recited in claim 1, wherein the speaker location is denoted by speaker face angles identified by azimuth and elevation in the video sample.

10. A method, comprising:
- displaying an audio/visual (A/V) sample having two or more speakers included therein;
  
  detecting a primary speaker of the two or more speakers;
  
  clustering the two or more speakers temporally and spatially;
  
  displaying a speaker timeline corresponding to each speaker of the two or more speakers, the speaker timeline indicating at what points along a temporal continuum the speaker corresponding to the speaker timeline is speaking;
  
  associating a speaker facial image with each speaker timeline, the speaker facial image corresponding to the speaker associated with the speaker timeline; and
  
  displaying the facial image with the corresponding speaker timeline.
- View Dependent Claims (11, 12)
- - 11. The method as recited in claim 10, further comprising retrieving the speaker timelines from a timeline database that identifies each speaker by a speaker identifier, a speaker location and one or more times at which the speaker is speaking.
  - 12. The method as recited in claim 10, further comprising retrieving the speaker facial image from a faces database that associates each speaker identifier with at least one facial image of a speaker corresponding to the speaker identifier.

13. Computer storage media containing executable instructions that, when executed, implement the following method:
- identifying each speaker in an Audio/Video (“
  
  A/V”
  
  ) sample by a speaker identifier;
  
  identifying location for each speaker in the A/V sample;
  
  detecting a primary speaker;
  
  clustering each identified speaker temporally and spatially;
  
  extracting at least one facial image for each speaker identified in the A/V sample;
  
  creating a speaker timeline for each speaker identified in the A/V sample, each speaker timeline indicating a time, a speaker identifier and a speaker location; and
  
  associating the facial image for a speaker with a speaker timeline that corresponds to the same speaker.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 14. The computer storage media as recited in claim 13, further comprising identifying each speaker using sound source localization.
  - 15. The computer storage media as recited in claim 13, further comprising identifying each speaker location using a face tracker.
  - 16. The computer storage media as recited in claim 13, wherein the speaker location is identified by a speaker bounding box in the A/V sample.
  - 17. The computer storage media as recited in claim 13, further comprising storing the speaker timelines and the facial images and linking each speaker timeline with the appropriate facial image.
  - 18. The computer storage media as recited in claim 13, further comprising extracting more than one facial image for each speaker.
  - 19. The computer storage media as recited in claim 18, further comprising selecting a best facial image to associate with the speaker timeline.
  - 20. The computer storage media as recited in claim 19, wherein the selecting a best facial image further comprises selecting a facial image that has a maximum frontal facial image.
  - 21. The computer storage media as recited in claim 19, wherein the selecting a best facial image further comprises selecting a facial image that exhibits the least motion.
  - 22. The computer storage media as recited in claim 19, wherein the selecting a best facial image further comprises selecting a facial image that exhibits maximum facial symmetry.

23. Computer storage media, comprising:
- a speaker timeline database that includes a speaker timeline for each speaker in an A/V sample, each speaker timeline identifying a speaker and a speaker location for multiple times along a time continuum wherein a primary speaker has been determined and wherein each identified speaker has been clustered temporally and spatially; and
  
  a faces database that includes at least one facial image for each speaker identified in a speaker timeline and a speaker identifier that links each facial image with the appropriate speaker timeline in the speaker timeline database.
- View Dependent Claims (24)
- - 24. The computer storage media as recited in claim 23, wherein each speaker timeline in the speaker timeline database includes the appropriate speaker identifier to link the speaker timeline database with the faces database.

25. A system, comprising:
- an Audio/Video (“
  
  A/V”
  
  ) sample;
  
  means for identifying each speaker appearing in the A/V sample;
  
  means for identifying a facial image for each speaker identified in the A/V sample;
  
  means for detecting a primary speaker;
  
  means for clustering each speaker temporally and spatially;
  
  means for creating a speaker timeline for each speaker identified in the A/V sample; and
  
  means for associating a facial image with an appropriate speaker timeline.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32)
- - 26. The system as recited in claim 25, wherein the means for identifying each speaker further comprises a sound source localizer.
  - 27. The system as recited in claim 25, wherein the means for identifying a facial image further comprises a face tracker.
  - 28. The system as recited in claim 25, wherein a speaker timeline identifies a speaker associated with the speaker timeline by a speaker identifier and a speaker location for each of multiple times along a time continuum.
  - 29. The system as recited in claim 28, wherein the associating a facial image with an appropriate speaker timeline further comprises associating each facial image with the speaker identifier.
  - 30. The system as recited in claim 25, further comprising storing the speaker timelines and the facial images.
  - 31. The system as recited in claim 30, wherein the speaker timelines and the facial images are stored separately.
  - 32. The system as recited in claim 25, wherein the A/V sample further comprises a recorded meeting.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Cutler, Ross G.
Primary Examiner(s)
Woo; Stella L

Application Number

US10/978,172
Publication Number

US 20050285943A1
Time in Patent Office

1,802 Days
Field of Search

348/14.06, 348/14.08, 348/14.09, 370/260
US Class Current

348/14.08
CPC Class Codes

G06V 40/173   face re-identification, e.g...

H04N 1/3876   Recombination of partial im...

H04N 1/6027   Correction or control of co...

H04N 17/002   for television cameras

H04N 23/698   for achieving an enlarged f...

H04N 23/90   Arrangement of cameras or c...

H04N 7/147   Communication arrangements,...

H04N 7/15   Conference systems

H04N 7/155   involving storage of or acc...

H04N 9/73   Colour balance circuits, e....

Automatic face extraction for use in recorded meetings timelines

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

97 Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic face extraction for use in recorded meetings timelines

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

97 Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links