Method and apparatus for audio/image speaker detection and locator
First Claim
1. A video conferencing system comprising:
- an image pickup device for generating image signals representative of an image;
an audio pickup device for generating audio signals representative of sound from an audio source; and
a multimodal integration architecture system for processing said image signals and said audio signals to determine a direction of the audio source relative to a reference point.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus for a video conferencing system using an array of two microphones and a stationary camera to automatically locate a speaker and electronically manipulate the video image to produce the effect of a movable pan tilt zoom (“PTZ”) camera. Computer vision algorithms are used to detect, locate, and track people in the field of view of a wide-angle, stationary camera. The estimated acoustic delay obtained from a microphone array, consisting of only two horizontally spaced microphones, is used to select the person speaking. This system can also detect any possible ambiguities, in which case, it cam respond in a fail-safe way, for example, it can zoom out to include all the speakers located at the same horizontal position.
94 Citations
16 Claims
-
1. A video conferencing system comprising:
-
an image pickup device for generating image signals representative of an image;
an audio pickup device for generating audio signals representative of sound from an audio source; and
a multimodal integration architecture system for processing said image signals and said audio signals to determine a direction of the audio source relative to a reference point. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method comprising the steps of:
-
generating, at an image pickup device, image signals representative of an image;
generating, at an audio pickup device, audio signals representative of sound from an audio source;
processing the image signals and the audio signals to determine a direction of the audio source relative to a reference point;
manipulating the image signals to produce refined image signals; and
outputting said refined image signals. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A video conferencing system comprising:
-
two microphones for generating audio signals representative of sound from a speaker;
a video camera for generating video signals representative of a video image;
an electronic pan tilt zoom system for manipulating video images to produce the visual effects of panning, tilting, and/or zooming;
a processor for processing the video signals and the audio signals to determine a direction of a speaker relative to a reference point and supplying control signals to the electronic pan tilt zoom system for producing images that include the speaker in the field of view of the camera, the control signals being generated based on the determined direction of the speaker; and
a transmitter for transmitting audio and video signals for video conferencing.
-
Specification