Speaker anticipation

US 10,477,148 B2
Filed: 07/11/2017
Issued: 11/12/2019
Est. Priority Date: 06/23/2017
Status: Active Grant

First Claim

Patent Images

1. A method of anticipating a video switch to accommodate a new speaker in a video conference comprising:

analyzing, by at least one speaker anticipation model, a real time video stream captured by a camera local to a first videoconference endpoint;

predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds of the new speaker, each of the labeled video frames including one of two different labels based on audio;

sending video of the new speaker to a conferencing server in response to a request for the video of the new speaker from the conferencing server; and

distributing, via the conferencing server, the video of the new speaker to a second videoconference endpoint,wherein,each of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, andeach of the labeled video frames including the audio from the same endpoint is labeled as speaking frames.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are disclosed for anticipating a video switch to accommodate a new speaker in a video conference comprising a real time video stream captured by a camera local to a first videoconference endpoint is analyzed according to at least one speaker anticipation model. The speaker anticipation model predicts that a new speaker is about to speak. Video of the anticipated new speaker is sent to the conferencing server in response to a request for the video on the anticipated new speaker from the conferencing server. Video of the anticipated new speaker is distributed to at least a second videoconference endpoint.

Citations

15 Claims

1. A method of anticipating a video switch to accommodate a new speaker in a video conference comprising:
- analyzing, by at least one speaker anticipation model, a real time video stream captured by a camera local to a first videoconference endpoint;
  
  predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds of the new speaker, each of the labeled video frames including one of two different labels based on audio;
  
  sending video of the new speaker to a conferencing server in response to a request for the video of the new speaker from the conferencing server; and
  
  distributing, via the conferencing server, the video of the new speaker to a second videoconference endpoint,wherein,each of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, andeach of the labeled video frames including the audio from the same endpoint is labeled as speaking frames.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1,wherein,the first videoconference endpoint includes a conference participant,the video conference is a web meeting, andthe historical video feeds include the conference participant.
  - 3. The method of claim 1,wherein,the at least one speaker anticipation model is derived by:
    - training a first machine learning algorithm on the guided learning dataset, wherein the first machine learning algorithm analyzes static frames to identify visual speaker anticipation queues, andproviding an output of the first machine learning algorithm as an input to a second machine learning algorithm,the second machine learning algorithm analyzes sequences of the static frames and sequences of the visual speaker anticipation queues, andthe output of the second machine learning algorithm is the at least one speaker anticipation model.
  - 4. The method of claim 1, further comprising:
    - analyzing video data consisting of the real time video stream captured by the camera local to the first videoconference endpoint to generate labeled real time video frames, the labeled real time video frames labeled as real time video speaking frames or real time video pre-speaking frames; and
      
      applying a machine learning algorithm to analyze the labeled real time video frames and update the at least one speaker anticipation model.
  - 5. The method of claim 1,wherein,the first videoconference endpoint is a video conference room system that includes a plurality of conference participants, andthe at least one speaker anticipation model is a semantic representation model created by a machine learning algorithm that has analyzed a plurality of signals including at least one of video data, audio data, and conference participant in-room location data, from historical videoconferences taking place in a specific meeting room.
  - 6. The method of claim 5, further comprising:
    - ranking the plurality of signals captured during a real time video conference according to the semantic representation model;
      
      ignoring signals having a ranking below a threshold; and
      
      sending a request for high resolution video to the first videoconference endpoint when the ranking of at least one of the plurality of signals is above the threshold.

7. A videoconference system for anticipating a video switch to accommodate a new speaker in a video conference, the system comprising:
- a processor;
  
  a storage device in communication with the processor;
  
  a videoconference server in communication with the processor; and
  
  a camera in communication with the processor,a first videoconference endpoint participating in a multi-endpoint meeting hosted by the videoconference server, the first videoconference endpoint configured, via the storage device and the processor, to;
  
  analyze a real time video stream captured by the camera local to the first videoconference endpoint according to at least one speaker anticipation model;
  
  predict, by the at least one speaker anticipation model, that a new speaker is about to speak to yield a prediction, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds of the new speaker, each of the labeled video frames including one of two different labels based on audio;
  
  send video of the new speaker to the videoconference server in response to a request for the video of the new speaker from the videoconference server; and
  
  distribute the video of the new speaker to a second videoconference endpoint,whereineach of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, andeach of the labeled video frames including the audio from the same endpoint is labeled as speaking frames.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The first videoconference endpoint of claim 7, wherein the at least one speaker anticipation model is further derived by:
    - training a first machine learning algorithm on the guided learning dataset; and
      
      providing an output of the first machine learning algorithm as an input to a second machine learning algorithm, the output of the second machine learning algorithm being the at least one speaker anticipation model.
  - 9. The first videoconference endpoint of claim 8, wherein the first machine learning algorithm analyzes static frames to identify visual speaker anticipation queues.
  - 10. The first videoconference endpoint of claim 9, wherein the second machine learning algorithm analyzes sequences of the static frames and sequences of the visual speaker anticipation queues.
  - 11. The first videoconference endpoint of claim 7, wherein predicting that the new speaker is about to speak further comprises an attention mechanism associated with a conference room that comprises:
    - collecting video, audio, and gesture recognition to perform a predictive analysis;
      
      allowing camera feed focus in real time; and
      
      learning which actions are significant to focus camera attention on.
  - 12. The first videoconference endpoint of claim 7, further comprising:
    - determining content of the distributed video based at least in part on the prediction; and
      
      wherein the prediction determines at least one of framing the new speaker or framing multiple anticipated new speakers at the same time.

13. A method of anticipating a video switch to accommodate a new speaker in a video conference, the method comprising the steps of:
- analyzing a real time video stream captured by a camera local to a first videoconference endpoint according to at least one speaker anticipation model;
  
  predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds, each of the labeled video frames including one of two different labels based on audio;
  
  sending video of the new speaker to a conferencing server in response to a request for the video of the new speaker from the conferencing server;
  
  distributing the video of the new speaker to a second videoconference endpoint;
  
  ranking a plurality of signals captured during a real time video conference according to a semantic representation model;
  
  ignoring signals having a ranking below a threshold; and
  
  sending a request for high resolution video to the first videoconference endpoint when the ranking of at least one of the plurality of signals is above the threshold,whereineach of the labeled video frames is default-labeled as a pre-speaking frame except for any of the labeled video frames including audio from a same endpoint, andeach of the labeled video frames including the audio from the same endpoint is labeled as speaking frames.
- View Dependent Claims (14, 15)
- - 14. The method of claim 13,wherein,the first videoconference endpoint is a video conference room system that includes a plurality of conference participants.
  - 15. The method of claim 13,wherein,the at least one speaker anticipation model is a semantic representation model created by a machine learning algorithm after analysis of a plurality of signals including video data, audio data, and conference participant in-room location data from historical videoconferences taking place in a meeting room.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Inventors
Bright-Thomas, Paul, Buckles, Nathan, Griffin, Keith, Chen, Eric, Kesavan, Manikandan, Nedeltchev, Plamen, Latapie, Hugo Mike, Fenoglio, Enzo
Primary Examiner(s)
Nguyen, Phung-Hoang J

Application Number

US15/646,470
Publication Number

US 20180376108A1
Time in Patent Office

854 Days
Field of Search
US Class Current
CPC Class Codes

G06F 18/24143   Distances to neighbourhood ...

G06V 10/454   Integrating the filters int...

G06V 10/82   using neural networks

G06V 20/40   in video content extracting...

G06V 20/41   Higher-level, semantic clus...

G06V 30/19173   Classification techniques

G06V 40/174   Facial expression recognition

G10L 15/1815   Semantic context, e.g. disa...

G10L 25/57   for processing of video sig...

H04N 7/147   Communication arrangements,...

H04N 7/15   Conference systems

H04N 7/152   Multipoint control units th...

Speaker anticipation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Speaker anticipation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links