Method and apparatus for using face detection information to improve speaker segmentation

US 9,165,182 B2
Filed: 08/19/2013
Issued: 10/20/2015
Est. Priority Date: 08/19/2013
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

obtaining media, the media including a video stream and an audio stream through an input/output (I/O) interface of a computing system;

detecting a number of faces visible in the video stream; and

performing a speaker segmentation on the media, wherein performing the speaker segmentation includes utilizing the number of faces visible in the video stream to augment the speaker segmentation, the speaker segmentation being performed by the computing system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In one embodiment, a method includes obtaining media that includes a video stream and an audio stream. The method also includes detecting a number of faces visible in the video stream, and performing a speaker segmentation on the media. Performing the speaker segmentation on the media includes utilizing the number of faces visible in the video stream to augment the speaker segmentation.

Citations

17 Claims

1. A method comprising:
- obtaining media, the media including a video stream and an audio stream through an input/output (I/O) interface of a computing system;
  
  detecting a number of faces visible in the video stream; and
  
  performing a speaker segmentation on the media, wherein performing the speaker segmentation includes utilizing the number of faces visible in the video stream to augment the speaker segmentation, the speaker segmentation being performed by the computing system.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 further including:
    - determining whether there is a face change associated with the video stream, wherein performing the speaker segmentation further includes obtaining an indication of the face change and utilizing the indication of the face change.
  - 3. The method of claim 2 wherein performing the speaker segmentation on the media includes setting an upper-bound for an audio clustering criterion to be approximately equal to the number of faces detected in the video stream.
  - 4. The method of claim 3 wherein performing the speaker segmentation on the media further includes performing an audio change detection identification, and wherein when it is determined that there is the face change associated with the video stream, performing the audio change detection identification includes setting the indication of the face change as a starting point for the audio change detection identification.
  - 5. The method of claim 1 further including:
    - determining a level of noise associated with the audio stream, wherein performing the speaker segmentation on the media further includes weighting an importance of the number of faces as used in performing the speaker segmentation based on the level of noise.

6. A tangible, non-transitory computer-readable medium comprising computer program code, the computer program code, when executed, configured to:
- obtain media, the media including a video stream and an audio stream;
  
  detect a number of faces visible in the video stream; and
  
  perform a speaker segmentation on the media, wherein the computer program code configured to perform the speaker segmentation includes computer program code operable to utilize the number of faces visible in the video stream to augment the speaker segmentation.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The tangible, non-transitory computer-readable medium comprising computer program code of claim 6 wherein the computer program code is further configured to:
    - determine whether there is a face change associated with the video stream, wherein the computer program code configured to perform the speaker segmentation further includes computer program code configured to obtain an indication of the face change and to utilize the indication of the face change.
  - 8. The tangible, non-transitory computer-readable medium comprising computer program code of claim 7 wherein the computer program code configured to perform the speaker segmentation on the media is further configured to set an upper-bound for an audio clustering criterion to be approximately equal to the number of faces detected in the video stream.
  - 9. The tangible, non-transitory computer-readable medium comprising computer program code of claim 8 wherein the computer program code configured to perform the speaker segmentation on the media is further configured to perform an audio change detection identification, and wherein when it is determined that there is the face change associated with the video stream, the computer program code configured to perform the audio change detected identification is further configured to set the indication of the face change as a starting point for the audio change detection identification.
  - 10. The tangible, non-transitory computer-readable medium comprising computer program code of claim 6 wherein the computer program code is further configured to:
    - determine a level of noise associated with the audio stream, wherein the computer program code configured to perform the speaker segmentation on the media is further configured to weight an importance of the number of faces as used in performing the speaker segmentation based on the level of noise.

11. An apparatus comprising:
- a face detection arrangement, the face detection arrangement being configured to process a video component of media to identify a number of faces in the video component; and
  
  a speaker segmentation arrangement, the speaker segmentation arrangement being configured to process an audio component of the media to identify a speaker change in the audio component, wherein the speaker segmentation arrangement is configured to use the number of faces in the video component when processing the audio component to identify the speaker change; and
  
  a processor, wherein the face detection arrangement and the speaker segmentation arrangement are embodied as logic on a tangible, non-transitory computer-readable medium, and wherein the logic is arranged to be executed by the processor.
- View Dependent Claims (12, 13, 14, 15, 16, 17)
- - 12. The apparatus of claim 11 wherein the video component and the audio component are part of a first segment of the media, the first segment of the media having a start point and an end point, and wherein the speaker change is a point between the start point and the end point.
  - 13. The apparatus of claim 11 wherein the face detection arrangement is further configured to process the video component to determine whether the video component includes a face change, wherein when it is determined that the video component includes the face change, the speaker segmentation arrangement is further configured to use information associated with the face change to identify the speaker change.
  - 14. The apparatus of claim 13 wherein the speaker segmentation arrangement is configured to perform an audio change detection identification and wherein the information associated with the face change is set a starting point for the audio change detection identification.
  - 15. The apparatus of claim 11 wherein the speaker segmentation arrangement is configured to set at least one audio clustering criterion, and wherein the number of faces in the video component is set as an upper-bound for the at least one audio clustering criterion.
  - 16. The apparatus of claim 11 wherein when the number of faces in the video is zero, the speaker segmentation arrangement is configured to identify the media as potentially containing no speech.
  - 17. The apparatus of claim 11 further including:
    - an input/output (I/O) interface, the I/O interface being arranged to obtain the media.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Inventors
Kajarekar, Sachin S., Sen, Mainak
Primary Examiner(s)
Gauthier, Gerald

Application Number

US13/969,914
Publication Number

US 20150049247A1
Time in Patent Office

792 Days
Field of Search

348/14.08, 348/484, 348/700, 348/169, 348/222.1, 375/240.16, 382/118, 382/115, 382/190, 382/224, 704/246, 704/270, 704/231, 704/233, 704/259, 715/811, 700/94
US Class Current

1/1
CPC Class Codes

G06V 40/161   Detection; Localisation; No...

H04N 7/147   Communication arrangements,...

H04S 7/303   Tracking of listener positi...

Method and apparatus for using face detection information to improve speaker segmentation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for using face detection information to improve speaker segmentation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links