Locating an audio source
First Claim
1. A system comprising:
- an image pickup device generating image signals representative of an image;
an audio pickup device generating audio signals representative of sound from an audio source;
an audio based locator for processing the audio signals to determine a direction of the audio source relative to a reference point;
a video based locator including a video face location module that processes the image signals for pixels having flesh tone colors to identify an object in the direction of the audio source; and
a pointing control that steers the image pickup device to frame the object in the direction of the audio source.
12 Assignments
0 Petitions
Accused Products
Abstract
A system, such as a video conferencing system, is provided which includes an image pickup device, an audio pickup device, and an audio source locator. The image pickup device generates image signals representative of an image, while the audio pickup device generates audio signals representative of sound from an audio source, such as speaking person. The audio source locator processes the image signals and audio signals to determine a direction of the audio source relative to a reference point. The system can further determine a location of the audio source relative to the reference point. The reference point can be a camera. The system can use the direction or location information to frame a proper camera shot which would include the audio source.
-
Citations
64 Claims
-
1. A system comprising:
-
an image pickup device generating image signals representative of an image;
an audio pickup device generating audio signals representative of sound from an audio source;
an audio based locator for processing the audio signals to determine a direction of the audio source relative to a reference point;
a video based locator including a video face location module that processes the image signals for pixels having flesh tone colors to identify an object in the direction of the audio source; and
a pointing control that steers the image pickup device to frame the object in the direction of the audio source. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 62)
a positioning device for positioning said image pickup device, wherein the audio source locator supplies control signals to the positioning device for positioning the image pickup device, the control signals including signals, based on the audio based direction, for causing said positioning device panning said image pickup device and signals, based on the video based location, for tilting said image pickup device.
-
-
20. The system of claim 1 wherein the video based locator determines a video based location of the image by determining, in part or in whole, a contour of the image.
-
21. The system of claim 20 wherein the video based locator uses a parameter in detecting the contour of the image, wherein changing the parameter in one direction increases a likelihood of detecting contours of images and changing that parameter in another direction decreases the likelihood, and the video based locator changes the parameter, when detecting the contour of the image, to increase or decrease the likelihood.
-
22. The system of claim 21 wherein the video based locator determines a noise level wherein an increase in the noise level decreases the likelihood of detecting contours of persons in images, wherein video based locator changes the parameter based on the noise level.
-
23. The system of claim 20 wherein the video based locator detects a region representing a moving person and determines, in part or in whole, a contour of an image of the moving person.
-
24. The system of claim 1 further comprising a memory unit for storing a video based location of an image in a previous one of the frames of video,
wherein the audio based locator correlates the audio based direction of the audio source with the stored video based location of the image in the previous one of the frames of video to determine whether the image in the previous one of the frames of video corresponds to the audio source, and audio based locator uses the audio based direction in producing control signals for the image pickup device, only if the audio based locator determines that the image in the previous one of the frames of video corresponds to the audio source. -
25. The system of claim 1 wherein the audio based locator detects a plurality of audio sources and uses a parameter to determine whether to validate at least one of the plurality of audio sources to use in producing control signals for the image pickup device, wherein changing the parameter in one direction increases a likelihood of audio based locator validating said at least one of the plurality of audio sources and changing that parameter in another direction decreases the likelihood, and
wherein the audio based locator correlates the audio based direction of the audio source with the stored video based location of the image in one of the frames of video to determine whether the image in the one of the frames of video corresponds to the audio source, and if the image in the one of the frames of video corresponds to the audio source, the audio based locator changes the parameter in the one direction. -
26. The system of claim 25 wherein if the image in the one of the frames of video fails to correspond to the audio source, the audio based locator changes the parameter in the another direction.
-
27. The system of claim 1 wherein the audio based locator correlates the audio based direction of the audio source with the video based location of the image in one of the frames of video to determine whether the image corresponds to the audio source, and
if the audio based locator determines that the image fails to correspond to the audio source, the audio based locator causes an adjustment in the field of view of said image pickup device to include, in the field of view, the audio source and the video based location of the image in the one of the frames of video. -
28. The system of claim 1 wherein the location is characterized by the direction of the audio source relative to the reference point and a distance, determined by the audio source locator, from the reference point to the audio source.
-
29. The system of claim 28 wherein the image signals represent frames of video images,
the audio based locator determines an audio based distance from the reference point to the audio source based on the audio signals, the video based locator determines a video based distance from the reference point to the audio source based on an image of the audio source in one of the frames of video, and the audio source locator determines the distance based on the audio based distance and the video based distance. -
62. The system of claim 1 wherein the video face location module further processes the image signals for motion pixels.
-
30. A method comprising the steps of:
-
generating, at an image pickup device, image signal representative of an image;
generating audio signals representative of sound from an audio source;
processing the audio signals to determine a direction of the audio source relative to a reference point;
processing the image signals to identify an object in the direction of the audio source;
processing the image signals for pixels having flesh tone colors; and
steering the image pickup device to frame the object in the direction of the audio source. - View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 63)
detecting the speaking person based on the audio signals, detecting images of the faces of a plurality of persons based on the video signals, and correlating the detected images to the speaking person to detect the image of the face of the speaking person.
-
-
34. The method of claim 32 further comprising the step of determining the direction of the face relative to the reference point.
-
35. The method of claim 32 wherein detecting an image of the face of the speaking person in one of the frames of video includes detecting a region representing a moving face.
-
36. The method of claim 32 wherein detecting an image of the face of the speaking person in one of the frames of video includes detecting a region having flesh tone colors in the frames of video.
-
37. The method of claim 36 wherein detecting an image of the face of the speaking person in one of the frames of video includes determining whether size of the region having flesh tone colors corresponds to a pre-selected size, the pre-selected size representing size of a pre-selected standard face.
-
38. The method of claim 36 wherein detecting an image of the face of the speaking person in one of the frames of video includes determining the region having flesh tone colors does not corresponds to an image of a face, if the region having flesh tone colors corresponds to a flesh tone colored object.
-
39. The method of claim 30 wherein processing the image signals and audio signals further includes determining the direction based on the audio based direction and the video based location of the object.
-
40. The method of claim 30 wherein processing the image signals and audio signals further includes
determining an offset of the video based location of the object from a pre-determined reference point in said one of the frames of video, and modifying the audio based direction, based on the offset, to determine the direction. -
41. The method of claim 30 wherein processing the image signals and audio signals further includes modifying, based on a previously determined offset of a video based location of an image in a previous one of the frames of video from a pre-determined reference point, the audio based direction to determine the direction.
-
42. The method of claim 30 wherein the object is a speaking person that moves relative to the reference point, and wherein processing the image signals and audio signals further includes:
-
detecting the movement of the speaking person, and causing, in response to the movement, an increase in the field of view of the image pickup device.
-
-
43. The method of claim 30 wherein processing the image signals and audio signals further includes:
-
correlating the audio based direction to a video based location of an image in one of the frames of video, and modifying the audio based direction, based on the results of said correlation, to determine the direction.
-
-
44. The method of claim 30 wherein processing the image signals and audio signals further includes using a previously determined direction of an audio source based on the audio signals and a previously determined video based location of an image of a face of a non-speaker person in a previous one of the frames of video to cause an adjustment in the field of view of the image pickup device to include, in the field of view, the audio source and previously determined video based location.
-
45. The method of claim 30 further comprising the step of supplying control signals, based on the audio based direction, for panning said image pickup device and signals, based on the video based location, for tilting said image pickup device.
-
46. The method of claim 45 wherein determining a video based location of the object includes determining, in part or in whole, a contour of the image.
-
47. The method of claim 46 wherein a parameter is used in detecting the contour of the image, wherein changing the parameter in one direction increases a likelihood of detecting contours of images and changing that parameter in another direction decreases the likelihood, the method further comprising the step of:
changing the parameter, when detecting the contour of the image, to increase or decrease the likelihood.
-
48. The method of claim 47 further comprising the steps of
determining a noise level, wherein an increase in the noise level decreases the likelihood of detecting contours of persons in images, and changing the parameter based on the noise level. -
49. The method of claim 45 wherein determining a video based location of the object includes:
- detecting a region representing a moving person, and determining, in part or in whole, a contour of an image of the moving person.
-
50. The method of claim 30 wherein processing the image signals and audio signals further includes:
-
correlating the audio based direction of the audio source with a video based location of an image in a frame of video to determine whether the image in the frame of video corresponds to the audio source, and using the audio based direction in producing control signals for the image pickup device, only if the image in the frame of video is determined to corresponds to the audio source.
-
-
51. The method of claim 30 wherein processing the image signals and audio signals further includes:
-
detecting a plurality of audio sources, using a parameter to determine whether to validate at least one of the plurality of audio sources to use in producing control signals for the image pickup device, wherein changing the parameter in one direction increases a likelihood validating said at least one of the plurality of audio sources and changing that parameter in another direction decreases the likelihood, correlating the audio based direction of the audio source with a video based location of an object in a previous one of the frames of video to determine whether the object in the previous one of the frames of video corresponds to the audio source, and changing the parameter in the one direction, if the object in the previous one of the frames of video corresponds to the audio source.
-
-
52. The method of claim 32 wherein processing the image signals and audio signals further includes:
- correlating the audio based direction of the audio source with a video based location of an object in a previous one of the frames of video to determine whether the object corresponds to the audio source, and if the object fails to correspond to the audio source, causing an adjustment in the field of view of said image pickup device to include, in the field of view, the audio source and the video based location of the object in the previous one of the frames of video.
-
53. The method of claim 32 wherein the location is characterized by the direction of the audio source relative to the reference point and a distance, determined by the audio source locator, from the reference point to the audio source.
-
54. The method of claim 53 wherein the image signals represent frames of video images, the method further comprising:
-
determining a audio based distance from the reference point to the audio source based on the audio signals, determining a video based distance from the reference point to the audio source based on an image of the audio source in one of the frames of video; and
determining the distance based on the audio based distance and the video based distance.
-
-
63. The method of claim 30 wherein identification of regions or segments in a frame which may contain the object is based on detecting pixels for pixels having flesh tone colors and which represent pixels which have moved.
-
55. A method for validating a detected speaker, comprising the steps of:
-
detecting audio signals representative of sound from an audio source;
processing the detected audio signals to determine a direction of the audio source;
comparing the processed audio signals with processed previous audio signals that were previously detected in a direction proximate to the direction of the processed audio signals;
determining that the processed audio signals include speech;
generating, at an image pickup device, image signals representative of a framed image obtained in the direction of the processed audio signals and including the detected speaker;
determining movement of the detected speaker based on the framed image and motion pixels; and
if a cross-correlation value between the processed audio signals and the processed previous audio signals is greater than a predetermined threshold value, verifying the detected speaker. - View Dependent Claims (56, 57, 58, 59, 60, 61, 64)
-
Specification