Using speaker clustering to switch between different camera views in a video conference system

US 9,633,270 B1
Filed: 04/05/2016
Issued: 04/25/2017
Est. Priority Date: 04/05/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

at a video conference endpoint including a microphone and a camera;

detecting a talker position of a talker based on audio detected by the microphone;

detecting one or more faces and face positions in video captured by the camera;

determining whether a detected talker position matches any detected face position;

if there is no match, performing speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio;

if the speaker clustering indicates the talker is known, determining whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and

based on results of the determining whether the detected talker position matches the previous closeup position of the talker, framing either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A video conference endpoint includes one or more cameras to capture video of different views and a microphone array to sense audio. One or more closeup views are defined. The endpoint detects faces in the captured video and active audio sources from the sensed audio. The endpoint detects any active talker having detected face positions that coincide with detected active audio sources, and also uses speaker clustering to detect whether any active talker is associated with a previously stored closeup views. Based on whether an active talker is detected in any of the stored closeup views, the endpoint switches between capturing video of one of the closeup views and a best overview of the participants in the conference room.

Citations

20 Claims

1. A method comprising:
- at a video conference endpoint including a microphone and a camera;
  
  detecting a talker position of a talker based on audio detected by the microphone;
  
  detecting one or more faces and face positions in video captured by the camera;
  
  determining whether a detected talker position matches any detected face position;
  
  if there is no match, performing speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio;
  
  if the speaker clustering indicates the talker is known, determining whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and
  
  based on results of the determining whether the detected talker position matches the previous closeup position of the talker, framing either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the framing includes:
    - if it is determined that the detected talker position matches the previous closeup position of the talker within a first predetermined positional tolerance, framing the closeup camera view; and
      
      if it is determined that the detected talker position does not match the previous closeup position of the talker within the first predetermined positional tolerance, framing the non-closeup camera view.
  - 3. The method of claim 2, wherein the framing further includes:
    - if it is determined that the detected talker position does not match the previous closeup position of the talker within the first predetermined positional tolerance;
      
      determining whether the detected talker position matches the previous closeup position of the talker within a second predetermined positional tolerance that is greater than the first predetermined positional tolerance;
      
      if it is determined that the detected talker position matches the previous closeup position of the talker within the second predetermined positional tolerance, framing the non-closeup camera view as a local camera view that is zoomed-out from the closeup camera view; and
      
      if it is determined that the detected talker position does not match the previous closeup position of the talker within the second predetermined positional tolerance, framing the non-closeup camera view as a best camera overview that is zoomed-out from the local camera view.
  - 4. The method of claim 2, wherein:
    - the framing the closeup camera view includes using a closeup camera zoom factor for the closeup camera view that is based on a distance of the talker position from the microphone; and
      
      the framing the local camera view includes using a local camera zoom factor for the local camera view that is a fraction of the closeup camera zoom factor.
  - 5. The method of claim 1, if the speaker cluster indicates that the talker is not known:
    - determining whether the detected talker position matches a previous closeup position of any previously known talker; and
      
      if it is determined that the detected talker position matches a previous closeup position of a previously known talker, framing a local camera view on the previous closeup view.
  - 6. The method of claim 5, if it is determined that the detected talker position does not match a previous closeup position of any previously known talker, framing a best overview that is zoomed-out from the closeup camera view.
  - 7. The method of claim 1, if it is determined that the detected talker position matches a detected face position, framing a closeup camera view on the detected talker position.
  - 8. The method of claim 1, wherein:
    - the speaker clustering includes attempting to match the speech segment in the detected audio to speech segments in the previously detected audio based on speech characteristics; and
      
      if the attempting to match succeeds, determining that the detected talker is known, otherwise determining that the detected talker is not known.
  - 9. The method of claim 8, further including:
    - if the speech segment matches one or more speech segments in a previous speaker cluster, assigning processed speech segments in the detected audio to the previous speaker cluster; and
      
      if the processed speech segments in the detected audio do not match speech segments in a previous speaker cluster, assigning processed speech segments in the detected audio to a new speaker cluster.
  - 10. The method of claim 1, further comprising detecting the previously detected talker position from previously detected audio and detecting the previously detected face from previously captured video.

11. An apparatus comprising:
- one or more cameras controllable to capture video of different views;
  
  a microphone array to sense audio; and
  
  a processor, coupled to the one or more cameras and the microphone array, to;
  
  detect a talker position of a talker based on audio detected by the microphone;
  
  detect one or more faces and face positions in video captured by the camera;
  
  determine whether a detected talker position matches any detected face position;
  
  if there is no match, perform speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio;
  
  if the speaker clustering indicates the talker is known, determine whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and
  
  based on results of determining whether the detected talker position matches the previous closeup position of the talker, frame either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The apparatus of claim 11, wherein the processor is configured to frame by:
    - if it is determined that the detected talker position matches the previous closeup position of the talker within a first predetermined positional tolerance, frame the closeup camera view; and
      
      if it is determined that the detected talker position does not match the previous closeup position of the talker within the first predetermined positional tolerance, frame the non-closeup camera view.
  - 13. The apparatus of claim 12, wherein the processor is further configured to frame by:
    - if it is determined that the detected talker position does not match the previous closeup position of the talker within the first predetermined positional tolerance;
      
      determine whether the detected talker position matches the previous closeup position of the talker within a second predetermined positional tolerance that is greater than the first predetermined positional tolerance;
      
      if it is determined that the detected talker position matches the previous closeup position of the talker within the second predetermined positional tolerance, frame the non-closeup camera view as a local camera view that is zoomed-out from the closeup camera view; and
      
      if it is determined that the detected talker position does not match the previous closeup position of the talker within the second predetermined positional tolerance, frame the non-closeup camera view as a best camera overview that is zoomed-out from the local camera view.
  - 14. The apparatus of claim 11, wherein the processor is configured to:
    - perform the speaker clustering by attempting to match the speech segment in the detected audio to speech segments in the previously detected audio based on speech characteristics; and
      
      if the attempting to match succeeds, determine that the detected talker is known, otherwise determine that the detected talker is not known.
  - 15. The apparatus of claim 14, wherein the processor is further configured to:
    - if the speech segment matches one or more speech segments in a previous speaker cluster, assign the processed speech segments in the detected audio to the previous speaker cluster; and
      
      if the processed speech segments in the detected audio do not match speech segments in a previous speaker cluster, assign the processed speech segments in the detected audio to a new speaker cluster.

16. A non-transitory processor readable medium storing instructions that, when executed by a processor, cause the processor to:
- detect a talker position of a talker based on audio detected by one or more microphones;
  
  detect one or more faces and face positions in video captured by one or more cameras;
  
  determine whether the detected talker position matches any detected face position;
  
  if there is no match, perform speaker clustering across a speech segment in the detected audio and speech segments in previously detected audio;
  
  if the speaker clustering indicates the talker is known, determine whether the detected talker position matches a previous closeup position of the talker, wherein the previous closeup position is a previously detected talker position that matches a previously detected face; and
  
  based on results of determining whether the detected talker position matches the previous closeup position of the talker, frame either a closeup camera view on the previous closeup position or a non-closeup camera view based on the detected talker position.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The processor readable medium of claim 16, wherein the instructions operable to frame include instructions operable to:
    - if it is determined that the detected talker position matches the previous closeup position of the talker within a first predetermined positional tolerance, frame the closeup camera view; and
      
      if it is determined that the detected talker position does not match the previous closeup position of the talker within the first predetermined positional tolerance, frame the non-closeup camera view.
  - 18. The processor readable medium of claim 16, wherein the instructions operable to frame further include instructions operable to:
    - if it is determined that the detected talker position does not match the previous closeup position of the talker within the first predetermined positional tolerance;
      
      determine whether the detected talker position matches the previous closeup position of the talker within a second predetermined positional tolerance that is greater than the first predetermined positional tolerance;
      
      if it is determined that the detected talker position matches the previous closeup position of the talker within the second predetermined positional tolerance, frame the non-closeup camera view as a local camera view that is zoomed-out from the closeup camera view; and
      
      if it is determined that the detected talker position does not match the previous closeup position of the talker within the second predetermined positional tolerance, framing the non-closeup camera view as a best camera overview that is zoomed-out from the local camera view.
  - 19. The processor readable medium of claim 16, wherein the instructions operable to perform speaker clustering include instructions operable to:
    - attempt to match the speech segment in the detected audio to speech segments in the previously detected audio based on speech characteristics; and
      
      if the attempting to match succeeds, determine that the detected talker is known, otherwise determine that the detected talker is not known.
  - 20. The processor readable medium of claim 19, wherein the instructions include instructions to cause the processor to control the one or more cameras to:
    - if the speech segment matches one or more speech segments in a previous speaker cluster, assign the processed speech segments in the detected audio to the previous speaker cluster; and
      
      if the processed speech segments in the detected audio do not match speech segments in a previous speaker cluster, assign the processed speech segments in the detected audio to a new speaker cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Inventors
Tangeland, Kristian, Hellerud, Erik, Birkenes, Oystein
Primary Examiner(s)
Ramakrishnaiah, Melur

Application Number

US15/091,056
Time in Patent Office

385 Days
Field of Search

348 1401- 1416, 34820799, 704270
US Class Current
CPC Class Codes

G06V 40/161   Detection; Localisation; No...

G10L 17/00   Speaker identification or v...

G10L 17/10   Multimodal systems, i.e. ba...

H04N 23/611   where the recognised object...

H04N 7/147   Communication arrangements,...

H04N 7/15   Conference systems

Using speaker clustering to switch between different camera views in a video conference system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Using speaker clustering to switch between different camera views in a video conference system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links