Method and apparatus for focus-of-attention control

US 8,913,103 B1
Filed: 02/01/2012
Issued: 12/16/2014
Est. Priority Date: 02/01/2012
Status: Active Grant

First Claim

Patent Images

1. A method for virtual camera control implemented by one or more computing devices, comprising:

acquiring at one or more computing devices a media stream having an audio component and a video component;

processing the video component to detect one or more video participants;

processing the video component to determine a video speaking state indicating a probability that at least one of the one or more detected video participants are currently speaking;

processing the audio component to detect one or more audio participants;

processing the audio component to determine an audio speaking state indicating a probability that at least one of the one or more detected audio participants are currently speaking;

identifying a person and a speaking state, associated with the person, based on the video speaking state and the audio speaking state; and

applying at least one video transformation to the video component based at least in part on the identified person and speaking state.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are methods for automatically generating commands to transform a video sequence based on information regarding speaking participants derived from the audio and video signals. The audio stream is analyzed to detect individual speakers and the video is optionally analyzed to detect lip movement to determine a probability that a detected participant is speaking. Commands are then generated to transform the video stream consistent with the identified speaker.

Citations

20 Claims

1. A method for virtual camera control implemented by one or more computing devices, comprising:
- acquiring at one or more computing devices a media stream having an audio component and a video component;
  
  processing the video component to detect one or more video participants;
  
  processing the video component to determine a video speaking state indicating a probability that at least one of the one or more detected video participants are currently speaking;
  
  processing the audio component to detect one or more audio participants;
  
  processing the audio component to determine an audio speaking state indicating a probability that at least one of the one or more detected audio participants are currently speaking;
  
  identifying a person and a speaking state, associated with the person, based on the video speaking state and the audio speaking state; and
  
  applying at least one video transformation to the video component based at least in part on the identified person and speaking state.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1 wherein the processing the video component to determine a video speaking state for one or more participants further comprises:
    - measuring lip motion for one or more of the detected participants.
  - 3. The method of claim 1, wherein the processing the audio component to determine an audio speaking state for one or more audio participants further comprises:
    - calculating audio features and merging the calculated audio features with a universal background model.
  - 4. The method of claim 1, wherein the identifying a person and a speaking state based on the video speaking state and the audio speaking state is based on Bayesian inference.
  - 5. The method of claim 1, wherein the video transformation is at least one of a zoom, a pan, a cut or a tilt.
  - 6. The method of claim 1, wherein the video transformation comprises changing a focus of attention region.

7. An apparatus for virtual camera control implemented by one or more computing devices, comprising:
- a memory;
  
  a processor operative to retrieve instructions from the memory and execute them to;
  
  acquire at one or more computing devices a media stream having an audio component and a video component;
  
  process the video component to detect one or more video participants;
  
  process the video component to determine a video speaking state indicating a probability that at least one of the one or more detected video participants are currently speaking;
  
  process the audio component to detect one or more audio participants;
  
  process the audio component to determine an audio speaking state indicating a probability that at least one of the one or more detected audio participants are currently speaking;
  
  identify a person and a speaking state, associated with the person, based on the video speaking state and the audio speaking state; and
  
  apply at least one video transformation to the video component based at least in part on the identified person and speaking state.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The apparatus of claim 7, wherein processing the video component to determine a video speaking state for one or more participants further comprises:
    - measuring lip motion for one or more of the detected participants.
  - 9. The apparatus of claim 7, wherein processing the audio component to determine an audio speaking state for one or more audio participants further comprises:
    - calculating audio features and merging the calculated audio features with a universal background model.
  - 10. The apparatus of claim 7, wherein identifying a person and a speaking state based on the video speaking state and the audio speaking state is based on Bayesian inference.
  - 11. The apparatus of claim 7, wherein the video transformation is at least one of a zoom, a pan, a cut or a tilt.
  - 12. The apparatus of claim 7, wherein the video transformation comprises changing a focus of attention region.

13. A method for virtual camera control implemented by one or more computing devices, comprising:
- acquiring at one or more computing devices a media stream having an audio component and a video component;
  
  processing the video component to detect one or more video participants;
  
  processing the video component to detect a location of at least one of the one or more video participants;
  
  processing the audio component to detect a location of at least one of one or more audio participants;
  
  processing the audio component to determine an audio speaking state for the at least one of the one or more audio participants, the audio speaking state indicating a probability that the at least one of the one or more audio participants is currently speaking;
  
  identifying a location of a person and a speaking state, associated with the person, based on the audio speaking state and the location of the at least one of the one or more audio participants, and the location of the at least one of the one or more video participants; and
  
  applying at least one video transformation to the video component based at least in part on the identified location of the person and the speaking state.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The method of claim 13, wherein the processing the video component to detect the location of at least one of the one or more video participants comprises using a 3D data.
  - 15. The method of claim 14, wherein the 3D data comprises one of passive video, stereo or triangulation.
  - 16. The method of claim 13, wherein the processing the audio component to detect the location of the at least one of the one or more audio participants further comprises:
    - using data from two or more microphones to locate the audio participants.
  - 17. The method of claim 13, wherein the processing the audio component to determine an audio speaking state for the at least one of the one or more audio participants further comprises:
    - calculating audio features and merging the calculated features with a universal background model.
  - 18. The method of claim 13, wherein the video transformation is at least one of a zoom, a pan, a cut or a tilt.
  - 19. The method of claim 13, wherein the video transformation comprises changing a focus of attention region.
  - 20. The method of claim 13, wherein the processing the audio component to determine an audio speaking state for the at least one of the one or more audio participants further comprises:
    - calculating, for each of the one or more audio participants, a universal background model (UBM) representing a speech characteristic of a detected average speaker;
      
      calculating a probability model (GMM) grouping similar audio features; and
      
      merging the UBM and the GMM, when the UBM and the UM are identified as being associated with a same speaker.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Carceroni, Rodrigo, Hua, Wei, Aradhye, Hrishikesh, Sargin, Emre, Ning, Huazhong, Renn, Marius
Primary Examiner(s)
ISLAM, MOHAMMAD K

Application Number

US13/363,948
Time in Patent Office

1,049 Days
Field of Search

348/14.1, 348/14.08, 348/14.09, 348/14.12, 709/204
US Class Current

348/14.12
CPC Class Codes

G06F 3/00   Input arrangements for tran...

G06V 20/40   in video content extracting...

G06V 40/16   Human faces, e.g. facial pa...

G06V 40/20   Movements or behaviour, e.g...

H04N 21/42203   sound input device, e.g. mi...

H04N 21/4394   involving operations for an...

H04N 21/44218   Detecting physical presence...

H04N 7/14   Systems for two-way working...

Method and apparatus for focus-of-attention control

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for focus-of-attention control

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links