Immersive video, including video hypermosaicing to generate from multiple video views of a scene a three-dimensional video mosaic from which diverse virtual video scene images are synthesized, including panoramic, scene interactive and stereoscopic images
First Claim
1. A method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, the method comprising:
- capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene;
predetermining a fixed framework of the scene as to the boundaries of the scene and selected fixed points of reference within the scene, the fixed framework and fixed reference points potentially but not necessarily coinciding with landmark objects in the scene if, indeed, any such landmark objects even exist;
creating from the captured video in consideration of the predetermined fixed framework a full three-dimensional model of the scene, the three-dimensional model being distinguished in that three-dimension occurrences in the scene are incorporated into the model regardless of that they should not have been pre-identified to the model;
producing from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene, thus immersive telepresence because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his/her desires;
wherein the representation is called immersive telepresence because it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene;
wherein the viewer-desired perspective on the scene, and the video representation in accordance with this viewer-desired perspective, need not be in accordance with any of the captured video.
1 Assignment
0 Petitions
Accused Products
Abstract
Immersive video, or television, images of a real-world scene are synthesized, including on demand and/or in real time, as are linked to any of a particular perspective on the scene, or an object or event in the scene. Synthesis is in accordance with user-specified parameters of presentation, including presentations that are any of panoramic, magnified, stereoscopic, or possessed of motional parallax. The image synthesis is based on computerized video processing--called "hypermosaicing"--of multiple video perspectives on the scene. In hypermosaicing a knowledge database contains information about the scene; for example scene geometry, shapes and behaviors of objects in the scene, and/or internal and/or external camera calibration models. Multiple video cameras each at a different spatial location produce multiple two-dimensional video images of the scene. A viewer/user specifies viewing criterion (ia) at a viewer interface. A computer, typically one or more engineering work station class computers or better, includes in software and/or hardware (i) a video data analyzer for detecting and for tracking scene objects and their locations, (ii) an environmental model builder combining multiple scene images to build a 3D dynamic model recording scene objects and their instant spatial locations, (iii) a viewer criterion interpreter, and (iv) a visualizer for generating from the 3D model in accordance with the viewing criterion one or more selectively synthesized 2D video image(s) of the scene.
-
Citations
46 Claims
-
1. A method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, the method comprising:
-
capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene; predetermining a fixed framework of the scene as to the boundaries of the scene and selected fixed points of reference within the scene, the fixed framework and fixed reference points potentially but not necessarily coinciding with landmark objects in the scene if, indeed, any such landmark objects even exist; creating from the captured video in consideration of the predetermined fixed framework a full three-dimensional model of the scene, the three-dimensional model being distinguished in that three-dimension occurrences in the scene are incorporated into the model regardless of that they should not have been pre-identified to the model; producing from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene, thus immersive telepresence because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his/her desires; wherein the representation is called immersive telepresence because it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene; wherein the viewer-desired perspective on the scene, and the video representation in accordance with this viewer-desired perspective, need not be in accordance with any of the captured video. - View Dependent Claims (2)
-
-
3. A method of immersive telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, the method comprising:
-
capturing video of a real-world scene from each of a multiplicity of different spatial perspectives on the scene; creating from the captured video a full three-dimensional model of the scene; producing from the three-dimensional model a video representation on the scene that is in accordance with the desired perspective on the scene of a viewer of the scene, thus immersive telepresence because the viewer can view the scene as if immersed therein, and as if present at the scene, all in accordance with his/her desires; wherein the representation is called immersive telepresence because it appears to the viewer that, since the scene is presented as the viewer desires, the viewer is immersed in the scene; wherein the viewer-desired perspective on the scene, and the video representation in accordance with this viewer-desired perspective, need not be in accordance with any of the captured video; wherein the video representation is in accordance with the position and direction of the viewer'"'"'s eyes and head, and exhibits motional parallax; wherein motional parallax is, normally and conventionally, a three-dimensional effect where different views on the scene are produced as the viewer moves position even should the viewer have but one eye, making the viewer'"'"'s brain to comprehend that the viewed scene is three-dimensional.
-
-
4. A method of telepresence, being a video representation of being at real-world scene that is other than the instant scene of the viewer, the method comprising:
-
capturing video of a real-world scene from a multiplicity of different spatial perspectives on the scene; creating from the captured video a full three-dimensional model of the scene; producing from the three-dimensional model a video representation on the scene responsively to a predetermined criterion selected from among criteria including an object in the scene and an event in the scene, thus interactive telepresence because the presentation to the viewer is interactive in response to the criterion; wherein the video presentation of the scene in accordance with the criterion need not be in accordance with any of the captured video. - View Dependent Claims (5, 6, 7)
-
-
8. An immersive video system for presenting video images of a real-world scene in accordance with a predetermined criterion, the system comprising:
a knowledge database containing information about the spatial framework of the real-world scene; multiple video sources each at a different spatial location for producing multiple two-dimensional video images of a real-world scene each at a different spatial perspective; a viewer interface at which a prospective viewer of the scene may specify a criterion relative to which criterion the viewer wishes to view the scene; a computer, receiving the multiple two-dimensional video images of the scene from the multiple video cameras and the viewer-specified criterion from the viewer interface, the computer for calculating in accordance with the spatial framework of the knowledge database as a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene, an environmental model builder for combining multiple individual video images of the scene to build a three-dimensional dynamic model of the environment of the scene within which three-dimensional dynamic environmental model potential objects of interest in the scene are recorded along with their instant spatial locations, and a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest in the scene, and with the spatial locations of these objects, as recorded in the dynamic environmental model in order to produce parameters of perspective on the scene, and a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene; and a video display, receiving the particular two-dimensional video image of the scene from the computer, for displaying this particular two-dimensional video image of the real-world scene to the viewer as that particular view of the scene which is in satisfaction of the viewer-specified criterion.
-
9. An immersive video system for presenting video images of a real-world scene in accordance with a predetermined criterion, the system comprising:
-
multiple video sources each at a different spatial location for producing multiple two-dimensional video images of a real-world scene each at a different spatial perspective; a knowledge database containing information about the real-world scene regarding at least two of the geometry of the real-world scene, potential shapes of objects in the real-world scene, dynamic behaviors of objects in the real-world scene, and a camera calibration model; a viewer interface at which a prospective viewer of the scene may specify a criterion relative to which criterion the viewer wishes to view the scene; a computer, receiving the multiple two-dimensional video images of the scene from the multiple video cameras and the viewer-specified criterion from the viewer interface, the computer operating in consideration of the knowledge database and including a video data analyzer for detecting and for tracking objects of potential interest and their locations in the scene, an environmental model builder for combining multiple individual video images of the scene to build a three-dimensional dynamic model of the environment of the scene within which three-dimensional dynamic environmental model potential objects of interest in the scene are recorded along with their instant spatial locations, and a viewer criterion interpreter for correlating the viewer-specified criterion with the objects of interest in the scene, and with the spatial locations of these objects, as recorded in the dynamic environmental model in order to produce parameters of perspective on the scene, and a visualizer for generating, from the three-dimensional dynamic environmental model in accordance with the parameters of perspective, a particular two-dimensional video image of the scene; and a video display, receiving the particular two-dimensional video image of the scene from the computer, for displaying this particular two-dimensional video image of the real-world scene to the viewer as that particular view of the scene which is in satisfaction of the viewer-specified criterion. - View Dependent Claims (10, 11)
-
-
12. An improvement to the method of video mosaicing, which video mosaicing method uses video frames from a video stream of a single video camera panning a scene, or, equivalently, the video frames from each of multiple video cameras each of which images only a part of the scene, in order to produce a larger video scene image than any single video frame from any single video camera,
the improved method being directed to generating a spatial-temporally coherent and consistent three-dimensional video mosaic from multiple individual video streams arising from each of multiple video cameras each of which is imaging at least a part of the scene from a perspective that is at least in part different from other ones of the multiple video cameras, the improved method being called video hypermosaicing, the video hypermosaicing method being applied to scenes where at a least a portion of the scene from the perspective of at least one camera is static, which limitation is only to say that absolutely everything in every part of the scene as is imaged to each of the multiple video cameras cannot be simultaneously in dynamic motion, the video hypermosaicing comprising: -
accumulating and storing as a priori information the static portion of the scene as a CSG/CAD model of the scene; and processing, in consideration of the CSG/CAD model of the scene, dynamic portions of the scene, only, from the multiple video steams of the multiple video cameras so as to develop a spatial-temporally coherent and consistent three-dimensional video mosaic of the scene; wherein the processing of static portions of the scene is bypassed; wherein bypassing of processing the static portions of the scene reduces the complexity of processing the scene. - View Dependent Claims (13, 14)
-
-
15. A method of composing arbitrary new video vistas on a scene from multiple video streams of the scene derived from different spatial perspectives on the scene, the method called video hypermosaicing because it transcends the generation of a two-dimensional video mosaic by video mosaicing and instead generates a spatial-temporally coherent and consistent three-dimensional video mosaic from multiple individual video streams arising from each of multiple video cameras each of which is imaging at least a part of the scene from a perspective that is at least in part different from other ones of the multiple video cameras, the video hypermosaicing composing method comprising:
-
receiving multiple video streams on a scene each of which streams comprises multiple pixels in a vista coordinate system V;
{(xv, yv, zv)};finding for each pixel (xv, yv, dv (xv, yv) on the vista the corresponding pixel point (x.sub.ω
, y.sub.ω
, z.sub.ω
) in a model, or world, coordinate system W;
{(x.sub.ω
, y.sub.ω
, z.sub.ω
) by using the depth value of the pixel, to wit x.sub.ω
y.sub.ω
z.sub.ω
1!T =Mv ·
xv yv zv 1!T ;projecting the found corresponding pixel point onto each of a plurality of camera image planes c of a camera coordinate system C;
{(xc, yc, zc)} by xc yc zc 1!T =Mc-1 ·
x.sub.ω
y.sub.ω
z.sub.ω
1!T where Mc is the 4×
4 homogeneous transformation matrix representing transformation between c and the world coordinate system, in order to produce camera coordinate pixel points (xc, yc, zc) ∀
c;testing said camera coordinate pixel points (xc, yc, zc) ∀
c for occlusion from view by comparing zc with the depth value for the found corresponding pixel point so as to produce several candidates that could be used for the pixel (xc, yc) for the vista;evaluating each candidate view cv by a criteria, to wit, first computing an angle A subtended by a line between a candidate camera and a vista position with the object point (x.sub.ω
, y.sub.ω
, z.sub.ω
) by use of the cosine formula A=arccos √
(b2 +c2 -a2)/(2bc), and then computing the distance of the object point (x.sub.ω
, y.sub.ω
, z.sub.ω
) from camera window coordinate (xc, yc), which is the depth value dc (xc, yc);evaluating each candidate view by an evaluation criterion ecv =f (A, B*dc (xc, yc)), where B is a small number; and repeating the receiving, the finding, the projecting, the testing and the evaluating for an instance of time of each video frame assuming a stationary viewpoint. - View Dependent Claims (16)
-
-
17. A method of presenting a particular stereoscopic two-dimensional video image of a real-world three dimensional scene to a viewer in accordance with a criterion supplied by the viewer, the method comprising:
-
imaging in multiple video cameras each at a different spatial location multiple two-dimensional video images of a real-world scene each at a different spatial perspective; combining in a computer the multiple two-dimensional images the scene into a three-dimensional model of the scene; receiving in a the computer from a prospective viewer of the scene a viewer-specified criterion relative to which criterion the viewer wishes to view the scene; synthesizing, in a computer from the three-dimensional model in accordance with the received viewer criterion, a stereoscopic two-dimensional image that is without exact correspondence to any of the images of the real-world scene that are imaged by any of the multiple video cameras; and displaying in a video display the particular stereoscopic two-dimensional image of the real-world scene to the viewer. - View Dependent Claims (18, 19)
-
-
20. A method of presenting a particular stereoscopic two-dimensional video image of a real-world three dimensional scene to a viewer in accordance with a criterion supplied by the viewer, the method comprising:
-
imaging in multiple video cameras each at a different spatial location multiple two-dimensional video images of a real-world scene each at a different spatial perspective; combining in a computer the multiple two-dimensional images of the scene into a three-dimensional model of the scene so as generate a three-dimensional model of the scene in which model objects in the scene are identified; receiving in a the computer from a prospective viewer of the scene a viewer-specified criterion of a selected object in the scene that the viewer wishes to particularly view; synthesizing, in a computer from the three-dimensional model in accordance with the received viewer criterion, a particular stereoscopic two-dimensional image of the selected object in the scene; and displaying to the viewer in the video display the particular stereoscopic image of the scene showing the viewer-selected object. - View Dependent Claims (21, 22, 23)
-
-
24. A method of presenting a particular stereoscopic two-dimensional video image of a real-world three dimensional scene to a viewer in accordance with a criterion supplied by the viewer, the method comprising:
-
imaging in multiple video cameras each at a different spatial location multiple two-dimensional video images of a real-world scene each at a different spatial perspective; combining in a computer the multiple two-dimensional images of the scene into a three-dimensional model of the scene so as generate a three-dimensional model of the scene in which model events in the scene are identified; receiving in a the computer from a prospective viewer of the scene a viewer-specified criterion of a selected event in the scene that the viewer wishes to particularly view; synthesizing, in a computer from the three-dimensional model in accordance with the received viewer criterion, a particular stereoscopic two-dimensional image of the selected event in the scene; and displaying to the viewer in the video display the particular stereoscopic image of the scene showing the viewer-selected event. - View Dependent Claims (25)
-
-
26. A method of synthesizing a stereoscopic virtual video image from real video images obtained by a multiple real video cameras, the method comprising:
-
storing in a video image database the real two-dimensional video images of a scene from each of a multiplicity of real video cameras; creating in a computer from the multiplicity of stored two-dimensional video images a three-dimensional video database containing a three-dimensional video image of the scene, the three-dimensional video database being characterized in that the three-dimensional location of objects in the scene is within the database; and synthesizing a two-dimensional stereoscopic virtual video image of the scene from the three-dimensional video database; wherein the synthesizing is facilitated because the three-dimensional spatial positions of all objects depicted in the stereoscopic virtual video image are known because of their positions within the three-dimensional video database, it being a mathematical transform to present a two-dimensional stereoscopic video image when the three-dimensional positions of objects depicted in the image are known. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A computerized method for presenting video images including a real-world scene, the method comprising:
-
constructing a three-dimensional environmental model containing both static and dynamic elements of the real world scene; producing multiple video streams showing two-dimensional images on the real-world scene from differing spatial positions; identifying static and dynamic portions of each of the multiple video streams; first warping at least some of corresponding portions of the multiple video streams onto the three-dimensional environmental model as reconstructed three-dimensional objects, wherein at least some image portions that are represented two-dimensionally in a single video stream assume a three-dimensional representation; and to synthesizing a two-dimensional video image that is without equivalence to any of the two-dimensional images that are within the multiple video streams from the three-dimensional environmental model containing the three-dimensional objects. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44, 45)
-
-
46. A computer system, receiving multiple video images of views on a real world scene, for synthesizing a video image of the scene which synthesized image is no identical to any of the multiple received video images, the system comprising:
-
an information base containing a geometry of the real-world scene, shapes and dynamic behaviors expected from moving objects in the scene, plus internal and external camera calibration models on the scene; a video data analyzer means for detecting and for tracking objects of potential interest in the scene, and locations of these objects; a three-dimensional environmental model builder means for recording the detected and tracked objects at their proper locations in a three-dimensional model of the scene, the recording being in consideration of the information base; a viewer interface means responsive to a viewer of the scene to receive a viewer selection of a desired view on the scene, which desired view need not be identical to any views that are within any of the multiple received video images; and a visualizer means for generating from the three-dimensional model of the scene in accordance with the received desired view a video image on the scene that so shows the scene from the desired view.
-
Specification