SPEAKER ANTICIPATION

US 20180376108A1
Filed: 07/11/2017
Published: 12/27/2018
Est. Priority Date: 06/23/2017
Status: Active Grant

First Claim

Patent Images

1. A method of anticipating a video switch to accommodate a new speaker in a video conference comprising:

analyzing a real time video stream captured by a camera local to a first videoconference endpoint according to at least one speaker anticipation model;

predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds;

sending video of the anticipated new speaker to the conferencing server in response to a request for the video of the anticipated new speaker from the conferencing server; and

distributing the video of the anticipated new speaker to a second videoconference endpoint.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are disclosed for anticipating a video switch to accommodate a new speaker in a video conference comprising a real time video stream captured by a camera local to a first videoconference endpoint is analyzed according to at least one speaker anticipation model. The speaker anticipation model predicts that a new speaker is about to speak. Video of the anticipated new speaker is sent to the conferencing server in response to a request for the video on the anticipated new speaker from the conferencing server. Video of the anticipated new speaker is distributed to at least a second videoconference endpoint.

29 Citations

View as Search Results

20 Claims

1. A method of anticipating a video switch to accommodate a new speaker in a video conference comprising:
- analyzing a real time video stream captured by a camera local to a first videoconference endpoint according to at least one speaker anticipation model;
  
  predicting, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds;
  
  sending video of the anticipated new speaker to the conferencing server in response to a request for the video of the anticipated new speaker from the conferencing server; and
  
  distributing the video of the anticipated new speaker to a second videoconference endpoint.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the first video conference endpoint is specific to a single conference participant, and the video conference is a web meeting, and wherein the at least one speaker anticipation model has been trained by a guided learning dataset from historical video feeds including a single conference participant.
  - 3. The method of claim 1, wherein the video frames being labeled as speaking frames when the video is accompanied by audio from the same endpoint and the video frames being labeled as pre-speaking frames when the video frame is not accompanied by audio from the same endpoint but precedes the frames labeled as speaking frames.
  - 4. The method of claim 3, wherein the at least one speaker anticipation model has been derived by:
    - training a first machine learning algorithm on the guided learning dataset, wherein the first machine learning algorithm analyzes static frames to identify visual speaker anticipation queues; and
      
      providing the output of the first machine learning algorithm as an input to a second machine learning algorithm, wherein the second machine learning algorithm analyzes sequences of the static frames and sequences of the visual speaker anticipation queues, the output of the second machine learning algorithm being the at least one speaker anticipation model.
  - 5. The method of claim 4, comprising:
    - analyzing the video data consisting of a real time video stream captured by the camera local to the first video conference endpoint to label frames of the real time video stream as speaking frames and pre-speaking frames; and
      
      applying a third machine learning algorithm to analyze the labeled real time video frames to update the at least one speaker anticipation model.
  - 6. The method of claim 1, wherein the first video conference endpoint is a video conference room system that may include a plurality of conference participants, and wherein the at least one speaker anticipation model is a semantic representation model that has been created by a machine learning algorithm that has analyzed a plurality of signals including at least one of the video data, audio data, and conference participant in-room location data, from historical videoconferences taking place in a specific meeting room.
  - 7. The method of claim 6, comprising:
    - ranking the plurality of signals captured during a real time video conference according to the semantic representation model;
      
      ignoring signals having a ranking below a threshold; and
      
      sending a request for high resolution video to the first video conference endpoint when the ranking of at least one of the plurality of signals is above a threshold.

8. At least one non-transitory computer readable medium comprising instructions that when executed cause at least one computing device to:
- receive, from a first video conference endpoint participating in a video conference, a prediction that a new speaker is about to speak at the first video conference endpoint;
  
  determine an allocation of media bandwidth distributed to the first video conference endpoint and at least a second video conference endpoint participating in the video conference, wherein the allocation of media bandwidth of the first video conference endpoint is increased based on the strength of the prediction; and
  
  request, from the first video conference endpoint, upgraded video of the new speaker according to the allocation.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The at least one non-transitory computer-readable medium of claim 8, wherein the prediction is based on a speaker anticipation model was created according to instructions to:
    - analyze static frames from historical video feeds to identify visual speaker anticipation cues using a first machine learning alogorithm;
      
      analyze sequences of frames from the historical video feeds, and sequences of the visual speaker anticipation cues by a second machine learning algorithm;
      
      based on the first machine learning algorithm and the second machine learning algorithm, determine the speaker anticipation model; and
      
      provide the speaker anticipation model to the first video conference endpoint.
  - 10. The at least one non-transitory computer-readable medium of claim 9, wherein the instructions cause the at least one computing device to:
    - after providing the speaker anticipation model to the first video conference endpoint, receive analyzed real time video frames from the first video conference endpoint, the real time video frames having been analyzed according to the speaker anticipation model; and
      
      based on receiving the analyzed real time video frames, update the speaker anticipation model.
  - 11. The at least one non-transitory computer-readable medium of claim 8, wherein the first video conference endpoint is a video conference room system that may include a plurality of conference participants, and wherein the speaker anticipation model is a semantic representation model created by a machine learning algorithm that analyzes a plurality of signals including at least one of the video data, audio data, and conference participant in-room location data, from historical videoconferences taking place in a specific meeting room.
  - 12. The at least one non-transitory computer-readable medium of claim 8, wherein predicting that a new speaker is about to speak includes instructions to:
    - collect video, audio, and gesture recognition to perform a predictive analysis;
      
      allow camera feed focus in real time; and
      
      learn which actions are significant to focus camera attention on.
  - 13. The at least one non-transitory computer-readable medium of claim 8, wherein the instructions cause the at least one computing device to:
    - determine content of the allocated video based at least in part on the prediction, wherein the prediction determines at least one of framing the new speaker or framing multiple new speakers at the same time.

14. A videoconference system for anticipating a video switch to accommodate a new speaker in a video conference, the system comprising:
- a first videoconference endpoint participating in a multi-endpoint meeting hosted by a videoconference server, the first videoconference endpoint configured to;
  
  analyze a real time video stream captured by a camera local to the first videoconference endpoint according to at least one speaker anticipation model;
  
  predict, by the at least one speaker anticipation model, that a new speaker is about to speak, the at least one speaker anticipation model trained by a guided learning dataset from historical video feeds derived from a series of labeled video frames from the historical video feeds;
  
  send video of the anticipated new speaker to the videoconference server in response to a request for the video of the anticipated new speaker from the videoconference server; and
  
  distribute the video of the anticipated new speaker to a second videoconference endpoint.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The first videoconference endpoint of claim 14, wherein the at least one speaker anticipation model has been trained by a guided learning dataset from historical video feeds that has been derived from a series of labeled video frames from the historical video feeds, the video frames being labeled as speaking frames when the video is accompanied by audio from the same endpoint and the video frames being labeled as pre-speaking frames when the video frame is not accompanied by audio from the same endpoint but precedes the frames labeled as speaking frames.
  - 16. The first videoconference endpoint of claim 15, wherein the at least one speaker anticipation model is further derived by:
    - training a first machine learning algorithm on the guided learning dataset; and
      
      providing the output of the first machine learning algorithm as an input to a second machine learning algorithm, the output of the second machine learning algorithm being the at least one speaker anticipation model.
  - 17. The first videoconference endpoint of claim 16, wherein the first machine learning algorithm analyzes static frames to identify visual speaker anticipation queues.
  - 18. The first videoconference endpoint of claim 17, wherein the second machine learning algorithm analyzes sequences of the static frames and sequences of the visual speaker anticipation queues.
  - 19. The first videoconference endpoint of claim 14, wherein predicting that a new speaker is about to speak further comprises an attention mechanism associated with a conference room that comprises:
    - collecting video, audio, and gesture recognition to perform a predictive analysis;
      
      allowing camera feed focus in real time; and
      
      learning which actions are significant to focus camera attention on.
  - 20. The first videoconference endpoint of claim 14, further comprising:
    - determining content of the distributed video based at least in part on the prediction; and
      
      wherein the prediction determines at least one of framing the anticipated new speaker or framing multiple anticipated new speakers at the same time.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Original Assignee
Cisco Technology, Inc. (Cisco Systems, Inc.)
Inventors
Bright-Thomas, Paul, Buckles, Nathan, Griffin, Keith, Chen, Eric, Kesavan, Manikandan, Nedeltchev, Plamen, Latapie, Hugo Mike, Fenoglio, Enzo

Granted Patent

US 10,477,148 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 18/24143   Distances to neighbourhood ...

G06V 10/454   Integrating the filters int...

G06V 10/82   using neural networks

G06V 20/40   in video content extracting...

G06V 20/41   Higher-level, semantic clus...

G06V 30/19173   Classification techniques

G06V 40/174   Facial expression recognition

G10L 15/1815   Semantic context, e.g. disa...

G10L 25/57   for processing of video sig...

H04N 7/147   Communication arrangements,...

H04N 7/15   Conference systems

H04N 7/152   Multipoint control units th...

SPEAKER ANTICIPATION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

29 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPEAKER ANTICIPATION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links