Method and apparatus for visual sensing of humans for active public interfaces
First Claim
1. A computerized interface for interacting with people, comprising:
- a camera providing video input representing a region of an arbitrary physical environment as a sequence of images organized with respect to distance from the interface;
means responsive to the video input for detecting changing distance between the interface and a person in the region from the sequence of images to determine intention of the person and hence identify the person as a target for interaction; and
audio visual means for rendering audio and video communication directed to the detected person in a manner such that the detected person engages in human-like informational conversant communication with the interface, content of the rendered audio and video communication changing as a function of change in detected distance between the interface and person.
4 Assignments
0 Petitions
Accused Products
Abstract
An active public user interface in a computerized kiosk senses humans visually using movement and color to detect changes in the environment indicating the presence of people. Interaction spaces are defined and the system records an initial model of its environment which is updated over time to reflect the addition or subtraction of inanimate objects and to compensate for lighting changes. The system develops models of the moving objects and is thereby able to track people as they move about the interaction spaces. A stereo camera system further enhances the system'"'"'s ability to sense location and movement. The kiosk presents audio and visual feedback in response to what it “sees.”
97 Citations
44 Claims
-
1. A computerized interface for interacting with people, comprising:
-
a camera providing video input representing a region of an arbitrary physical environment as a sequence of images organized with respect to distance from the interface;
means responsive to the video input for detecting changing distance between the interface and a person in the region from the sequence of images to determine intention of the person and hence identify the person as a target for interaction; and
audio visual means for rendering audio and video communication directed to the detected person in a manner such that the detected person engages in human-like informational conversant communication with the interface, content of the rendered audio and video communication changing as a function of change in detected distance between the interface and person. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
means for determining a velocity of the person in the region, said velocity including rate and direction of travel; and
wherein a content of the communication depends on the direction of travel of the person with respect to the interface.
-
-
4. The interface of claim 1, further comprising:
-
means for determining a velocity of the person in the region, said velocity including rate and direction of travel; and
means for determining whether to communicate based on the direction of travel of the person with respect to the interface.
-
-
5. The interface of claim 1 wherein the audiovisual means includes a display system displaying an image of a head including eyes and mouth with lips, the display system directing an orientation of the head and a gaze of the eyes at the detected person while rendering the information synchronized to movement of the lips so that the head appears to look at and talk to the person.
-
6. The interface of claim 1, further comprising:
means for determining a distance position and an orientation of the person in the region relative to a position of the camera.
-
7. The interface of claim 6, wherein the audiovisual means communicates information having a content dependent upon the determined distance position and the determined orientation of the person in the region.
-
8. The interface of claim 1, further comprising:
a memory, coupled to the means for detecting, the memory storing data representing a three-dimensional model of the physical environment for determining a position of the person in the region relative to objects represented in the three-dimensional model.
-
9. The interface of claim 8, wherein the audiovisual means communicates information having a content dependent upon the determined position of the person.
-
10. The interface of claim 1 wherein the sequence of images includes a reference image and a target image, each image being defined by pixels, the pixels of the reference image having a one-to-one correspondence to the pixels of the target image;
- and further comprising;
means for comparing the reference image to the target image to identify a group of adjacent pixels in the reference image that are different from the corresponding pixels in the target image, the identified group of pixels representing the person.
- and further comprising;
-
11. The interface of claim 10 wherein the means for comparing compares an intensity of each pixel of the reference image to an intensity of each corresponding pixel in the target image, and the means for detecting detects the presence of the person in the region when the intensities of at least a pre-defined number of the pixels of the reference image differ from the intensities of the corresponding pixels of the target image.
-
12. The interface of claim 10, further comprising:
means for blending the target image with the reference image to generate a new reference image when less than a pre-defined number of the pixels of the reference image differ from the corresponding pixels of the target image.
-
13. The interface of claim 1, further comprising:
a second camera spaced apart from the other camera, the second camera providing video input representing the region as a second sequence of images.
-
14. The interface of claim 13, further comprising:
means for determining an approximate three-dimensional position of the person in the region from the sequences of images of the cameras.
-
15. The interface of claim 1, further comprising:
a second camera spaced apart from the other camera, the second camera providing video input representing a different region of the physical environment than the region represented by the video input produced by the other camera, the cameras concurrently providing video input representing the regions of the physical environment.
-
16. The interface of claim 1 wherein the means for detecting detects respective decreasing distances of individual people of a plurality of persons in the region from the sequence of images.
-
17. The interface of claim 16, wherein the audiovisual means communicates information in turn with each individual of the plurality of detected persons such that the interface engages in informational communication with the plurality of detected persons.
-
18. The interface of claim 16 wherein the sequence of images includes a reference image and a target image, each image being defined by pixels, the pixels of the reference image having a one-to-one correspondence to the pixels of the target image;
- and further comprising;
means for comparing the reference image to the target image to identify a plurality of groups of adjacent pixels in the reference image that are different from the corresponding pixels in the target image, each identified group of pixels representing one of the plurality of detected persons.
- and further comprising;
-
19. The interface of claim 18, further comprising:
means for determining a distribution of colors in each of the group of pixels, each color distribution uniquely identifying one of the plurality of persons.
-
20. The interface of claim 19, further comprising:
means for concurrently tracking respective movements of each person independently in the region by the color distribution that uniquely identifies that person.
-
21. The interface of claim 1 wherein the region is partitioned into a plurality of sub-regions in which movement of the person can be independently detected and tracked, said subregions being organized according to distance from the interface.
-
22. The interface of claim 21, wherein the audiovisual means directs the communication primarily to persons that move into a predetermined one of the sub-regions.
-
23. A computerized interface for human-like informational conversant communications with people comprising:
-
a camera providing video input representing a region of an arbitrary physical environment as a sequence of images organized with respect to distance from the interface; and
means responsive to the video input for determining intent of a person in the region, said means including a behavior module to aid in active engagement of human-like informational conversant communications with a person detected to be changing in distance from the interface, said means rendering audio and video information directed at the detected person in the region from the sequence of images in a manner that engages the person in a conversation, said information changing in content as a function of change in detected distance between the interface and the detected person. - View Dependent Claims (24)
means for determining a distance position and an orientation of the detected person, the distance position and orientation of the person being relative to a position of the camera.
-
-
25. A kiosk, comprising:
-
means for providing video input representing a region of an arbitrary physical environment to detect a presence of an object in the region;
computer means responsive to the input for determining whether position of the detected object in the region is changing in distance relative to a position of the kiosk and hence determining intent of the detected object; and
means for rendering audio and video information directed at the detected object so as to engage the detected object in active informational conversant communication with the kiosk, a content of the rendered audio and video information depending upon the determined distance of the detected object relative to the kiosk.
-
-
26. A computerized method for interacting with people, comprising the steps of:
-
representing a region of an arbitrary physical environment as a sequence of images organized with respect to distance from the interface;
detecting a presence of and changing distance between a person and the interface in the region from the sequence of images to determine intent of the person and hence identify the person as a target for interaction; and
rendering visual and audio information directed at the detected person in a manner that engages the detected person in a conversation, said rendering using a behavior module to produce information which communicates in a manner that presents human-like informational conversant communications with the detected person, the information changing as a function of change in detected distance between the person and the interface. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43)
displaying an image of a head including eyes and a mouth with lips; and
directing an orientation of the head and a gaze of the eyes at the detected person while rendering the audio information synchronized to lip movement so that the head appears to look at and talk to the person.
-
-
29. The method of claim 26 wherein the region is measured by a camera;
- and further comprising the step of;
determining a position of the person in the region relative to a position of the camera.
- and further comprising the step of;
-
30. The method of claim 29, wherein the step of
rendering audio and visual information includes information having a content dependant upon the determined position of the person in the region. -
31. The method of claim 26, further comprising the step of:
determining a position of the person in the region relative to a pre-defined position of an object in the physical environment.
-
32. The method of claim 31, wherein the step of
rendering audio and visual information includes information having a content dependant upon the determined position of the person in the region. -
33. The method of claim 26 wherein the sequence of images includes a reference image and a target image, each image being defined by pixels, the pixels of the reference image having a one-to-one spatial correspondence to the pixels of the target image;
- and further comprising the steps of;
comparing the reference image to the target image to identify a group of adjacent pixels in the reference image that are different from the corresponding pixels in the target image, the identified group of pixels representing the person.
- and further comprising the steps of;
-
34. The method of claim 33 wherein the step of comparing compares an intensity of each pixel of the reference image to an intensity of each corresponding pixel in the target image and the step of detecting detects the presence of the person in the region when the intensities of at least a pre-defined number of the pixels of the reference image differ from the intensities of the corresponding pixels of the target image.
-
35. The method of claim 33, further comprising the step of:
blending the target image with the reference image to generate a new reference image when less than a pre-defined number of the pixels of the reference image differ from the corresponding pixels of the target image.
-
36. The method of claim 26, further comprising the steps of:
-
measuring the region as a second sequence of images; and
determining from the sequences of images an approximate three-dimensional position of the person in the region.
-
-
37. The method of claim 26, further comprising the step of:
detecting a plurality of persons in the region in the sequence of images.
-
38. The method of claim 37, wherein the step of rendering audio and visual information includes directing the information, in turn, at each detected person in the region.
-
39. The method of claim 37 wherein the sequence of images includes a reference image and a target image, each image being defined by pixels, the pixels of the reference image having a one-to-one correspondence to the pixels of the target image;
- and further comprising the step of;
comparing the reference image to the target image to identify a plurality of groups of adjacent pixels in the reference image that are different from the corresponding pixels in the target image, each identified group of pixels representing one of the plurality of detected persons.
- and further comprising the step of;
-
40. The method of claim 39, further comprising the step of:
determining a distribution of colors in each of the group of pixels, each color distribution uniquely identifying one of the plurality of persons.
-
41. The method of claim 40, further comprising the step of:
concurrently tracking movements of each person independently in the region by the color distribution that uniquely identifies that person.
-
42. The method of claim 26, further comprising the steps of:
-
determining a velocity of the person moving in the region, the velocity including a direction of travel; and
wherein the step of rendering audio and visual information includes information having a content dependant on the direction of travel of the person.
-
-
43. The method of claim 26, further comprising the steps of:
-
determining a velocity of the person moving in the region; and
determining whether to render audio and video information based on the velocity of the person.
-
-
44. A method for interacting with an animate object, comprising the steps of:
-
measuring a region of an arbitrary physical environment to detect a presence of an animated object in the region;
determining a change in relative distance of the detected object in the region; and
rendering audio and video information directed at the detected object using a behavior module so as to engage the detected object in informational conversant communication, content of the rendered audio and video information depending upon the determined change in relative distance of the detected object.
-
Specification