Method for generating spatialtemporally consistent depth map sequences based on convolution neural networks

0Associated
Cases 
0Associated
Defendants 
0Accused
Products 
0Forward
Citations 
0
Petitions 
0
Assignments
First Claim
1. (canceled)
0 Assignments
0 Petitions
Accused Products
Abstract
A method for generating spatialtemporal consistency depth map sequences based on convolutional neural networks for 2D3D conversion of television works includes steps of: 1) collecting a training set, wherein each training sample thereof includes a sequence of continuous RGB images, and a corresponding depth map sequence; 2) processing each image sequence in the training set with spatialtemporal consistency superpixel segmentation, and establishing a spatial similarity matrix and a temporal similarity matrix; 3) establishing the convolution neural network including a single superpixel depth regression network and a spatialtemporal consistency condition random field loss layer; 4) training the convolution neural network; and 5) recovering a depth maps of a RGB image sequence of unknown depth through forward propagation with the trained convolution neural network; which avoids that cluebased depth recovery method is greatly depended on scenario assumptions, and interframe discontinuity between depth maps generated by conventional neural networks.
0 Citations
No References
No References
5 Claims
 1. (canceled)
 2. :
 A method for generating spatialtemporally consistent depth map sequences based on convolution neural networks, comprising steps of;
1) collecting a training set, wherein each training sample of the training set comprises a continuous RGB (red, green, blue) image sequence of m frames, and a corresponding depth map sequence; 2) processing each image sequence in the training set with spatialtemporal consistency superpixel segmentation, and establishing a spatial similarity matrix S^{(s) }and a temporal similarity matrix S^{(t)}; 3) building a convolution neural network structure, wherein the convolution neural network comprises a single superpixel depth regression network with a parameter W, and a spatialtemporal consistency condition random field loss layer with a parameter α
;4) training the convolution neural network established in the step
3) with the continuous RGB image sequence and the corresponding depth map sequence in the training set, so as to obtain the parameter W and the parameter α
; and5) recovering a depth map sequence of a depthunknown RGB image sequence through forward propagation with the convolution neural network trained; wherein the step
2) specifically comprises steps of;(2.1) processing the continuous RGB image sequence in the training set with the spatialtemporal consistency superpixel segmentation, wherein an input sequence is marked as I=[I_{1}, . . . , I_{m}], where I_{t }is a tth frame of the m frames in total;
the m frames are respectively divided into n_{1}, . . . , n_{m }superpixels by the spatialtemporal consistency superpixel segmentation while a corresponding relation between all superpixels in a later frame and superpixels corresponding to a same object in a former frame is generated;
the whole image sequence comprises n=Σ
_{t=1}^{m}n_{t }superpixels;
marking a real depth at a gravity center of each superpixel p as d_{p}, and defining a groundtruth depth vector of the n superpixels as d=[d_{1};
. . . ;
d_{n}];(2.2) establishing the spatial similarity matrix S^{(s) }of the n superpixels, wherein S^{(s) }is an n×
n matrix;
S_{pq}^{(s) }represents a similarity relationship of a superpixel p and a superpixel q in one frame, where;  View Dependent Claims (3, 4, 5)
 A method for generating spatialtemporally consistent depth map sequences based on convolution neural networks, comprising steps of;
1 Specification
This is a U.S. National Stage under 35 U.S.C 371 of the International Application PCT/CN2016/112811, filed Dec. 29, 2016.
The present invention relates to the field of stereoscopic videos in computer vision, and more particularly to a method for generating spatialtemporally consistent depth map sequences based on convolution neural networks.
The basic principle of stereoscopic video is to superimpose two images with horizontal parallax, and viewers respectively view them with left and right eyes through stereoscopic glasses to generate stereoscopic perception. Stereoscopic videos are able to render immersive threedimensional view, which is deeply welcomed by consumers. However, with the popularity of 3D video hardware continues to rise, the 3D video content shortage followed. Directly shooting by the 3D camera is expensive and difficult to postprocess, which is usually only used in largecost movies. Therefore, 2D to 3D conversion technology for film and television works is an effective way to solve the shortage of film sources, which not only greatly expands the subject matter and the number of threedimensional films, but also makes some of the classic films and television works to return to the screen.
Since the left and right parallax in the stereoscopic video are directly related to the depth of each pixel, obtaining the depth map corresponding to each frame of the video is the key in 2D to 3D conversion technology. Depth maps can be obtained by manually annotating depth values for each frame of the video, but at a very high cost. At the same time, there are also some semiautomatic depth map generation methods, wherein the depth map of some key frames in the video is drawn manually, and the computer propagates these depth maps to other adjacent frames by some propagation algorithms. Although these methods can save time to some extent, they still require heavy manual operations when dealing with largescale 2D to 3D conversion of films and television works.
In contrast, fully automatic depth recovery methods can save the greatest labor costs. Some algorithms can recover depth maps using specific rules by using depth cues such as motion, focus, occlusions, or shadow, but are usually applicable only to specific scenes. For example, the structure from motion (SfM) approach is able to recover the depth of the static scenes shot by a moving camera according to the cue that the nearer objects display larger relative movement between consecutive frames than objects farer away, but this type of method is not applicable when the object moves or the camera is still. Focusbased depth recovery methods can restore the depth of the shallow field depth image; however, in the case of large field depth, it is less effective. Movies and television works usually contain a variety of scenes, so depthbased depth recovery methods are difficult to universally applied.
Convolutional neural network is a kind of deep neural network which is especially suitable for image. It consists of basic units such as convolution layer, activation layer, pooling layer and loss layer, which can simulate complex functions of image input x to specific output y, and the approach dominates many computer vision problems such as image classification, image segmentation and so on. In recent years, some methods have adopted convolution neural networks for depth estimation problem, using a large amount of data to learn the mapping from RGB image input to depth map output. Depth recovery based on convolution neural networks does not depend on any of assumptions, which has good universality and high recovery accuracy. Therefore, it has a great potential in the 2D3D conversion of video works. However, the conventional methods are all based on single image optimization in training, while ignoring the continuity between frames. If such methods are used to restore the depth of image sequences, the depth map of the adjacent frames will be significantly inconsistent, resulting in flashing of the synthetic virtual view, which seriously affects the user perception. In addition, interframe continuity also provides important clues to depth recovery, which is simply ignored in conventional methods.
For overcoming defects of convention technologies, an object of the present invention is to provide a method for generating a spatialtemporally consistent depth map sequences based on a convolution neural network, wherein the continuity of the RGB image and the depth map in the time domain is introduced into the convolution neural network, and the multiframe images are jointly optimized during training, so as to generate a temporally continuous depth map in the time domain and improve the accuracy of the depth recovery.
Accordingly, in order to accomplish the above object, the present invention provides a method for generating spatialtemporally consistent depth map sequences based on convolution neural networks, comprising steps of:
1) collecting a training set, wherein each training sample of the training set comprises a continuous RGB (red, green, blue) image sequence of m frames, and a corresponding depth map sequence;
2) processing each image sequence in the training set with spatialtemporal consistency superpixel segmentation, and establishing a spatial similarity matrix S^{(s) }and a temporal similarity matrix S^{(t)};
3) building the convolution neural network structure, wherein the convolution neural network comprises a single superpixel depth regression network with a parameter W, and a spatialtemporal consistency condition random field loss layer with a parameter α;
4) training the convolution neural network established in the step 3) with the continuous RGB image sequences and the corresponding depth map sequences in the training set, so as to obtain the parameter W and the parameter α; and
5) recovering a depth map sequence of a RGB image sequence with unknown depth through forward propagation of the trained convolution neural network.
Preferably, the step 2) specifically comprises steps of:
(2.1) processing the continuous RGB image sequence in the training set with the spatialtemporal consistency superpixel segmentation, wherein an input sequence is marked as I=[I_{1}, . . . , I_{m}], where I_{t }is a tth frame of the m frames in total; the m frames are respectively divided into n_{1}, . . . , n_{m }superpixels by the spatialtemporal consistency superpixel segmentation while a corresponding relation between all superpixels in a later frame and superpixels corresponding to a same object in a former frame is generated; the whole image sequence comprises n=Σ_{t=1}^{m}n_{t }superpixels; marking the real depth at a gravity center of each superpixel p as d_{p}, and defining a groundtruth depth vector of the n superpixels as d=[d_{1}; . . . ; d_{n}];
(2.2) establishing the spatial similarity matrix S^{(s) }of the n superpixels, wherein S^{(a) }is an n×n matrix; S_{pq}^{(s) }represents a similarity relationship of a superpixel p and a superpixel q in one frame, where:
wherein c_{p }and c_{q }are color histogram features of the superpixel p and the superpixel q, and γ is a manually determined parameter which is set to the median of ∥c_{p}−c_{q}∥^{2 }of adjacent superpixels; and
(2.3) establishing the temporal similarity matrix S^{(t) }of the n superpixels, wherein S^{(t) }is an n×n matrix; S_{pq}^{(t) }represents a similarity relation of a superpixel p and a superpixel q in different frames:
wherein a corresponding relation between the superpixels of the adjacent frames is obtained by the spatialtemporal consistency superpixel segmentation of the step (2.1).
Preferably, in the step 3), the convolution neural network comprises the single superpixel depth regression network and a spatialtemporal consistency condition random field loss layer; wherein the step 3) specifically comprises steps of:
(3.1) for the single superpixel depth regression network comprising first 31 layers of a VGG16 network, a superpixel pooling layer and three fully connected layers, which performs average pooling in each superpixel space of the superpixel pooling layer; wherein an input of the network is continuous RGB images of m frames and the output of the network is an ndimensional vector z=[z_{1}, . . . z_{n}], in which the pth element z_{p }is the estimated depth value, without considering any constraint, of the superpixel p of the continuous RGB image sequence after the spatialtemporal consistency superpixel segmentation; the convolution neural network parameter need to be learned is W; and
(3.2) using the output z=[z_{1}, . . . z_{n}] of the single superpixel depth regression network obtained in the step (3.1), the real depth vector d=[d_{1}; . . . ; d_{n}] of the superpixels obtained in the step (2.1), and the spatial similarity matrix S_{pq}^{(s) }obtained in the step (2.2) as well as the temporal similarity matrix S_{pq}^{(t) }obtained in the step (2.3) as the input of the spatialtemporal consistency condition random field loss layer; wherein a conditional probability function of a spatialtemporal consistency condition random field is:
wherein an energy function E(d,I) is defined as:
wherein the first term Σ_{pϵN}(d_{p}−z_{p})^{2 }of the energy function refers to a difference between an estimated value and a real value of a single superpixel; a second term Σ_{(p,q)ϵS}α^{(a)}S_{pq}^{(s)}(d_{p}−d_{p})^{2 }is a spatial consistency constraint, which means depths will be similar if the superpixels p and q are adjacent in one frame with similar colors (i.e. S_{pq}^{(s) }is larger); a third term Σ_{(p,q)ϵT}α^{(t)}S_{pq}^{(t)}(d_{p}−d_{p})^{2 }is a temporal consistency constraint, which means the depths will be similar if the superpixels p and q refer to a same object in adjacent frames (i.e. S_{pq}^{(t)}=1); a matrix form of the energy function is:
E(d,I)=d^{T}Ld−2z^{T}d+z^{T}z
wherein:
L=+D−M
M=α^{(s)}S^{(s)}+α^{(t)}S^{(t) }
wherein S^{(s) }is the spatial similarity matrix obtained in the step (2.2) and S^{(t) }is the temporal similarity matrix obtained in the step (2.3); α^{(s) }and α^{(t) }are two parameters to be learned; is an n×n unit matrix; D is a diagonal matrix, and D_{pp}=Σ_{q}M_{pq};
wherein:
wherein L^{−1 }is the inverse matrix of L, and L is the determinant of matrix L;
therefore, a loss function is defined as a negative logarithm of the conditional probability function:
Preferably, in the step 4), training the convolution neural network specifically comprises steps of:
(4.1) optimizing the parameters W, α^{(s) }and α^{(t) }using stochastic gradient descent, wherein for each iteration, the parameters are updated as:
wherein lr is the learning rate;
(4.2) calculating the partial derivative of parameter W partial with respect to the loss function J with:
wherein
is calculated with backward propagation of the convolution neural network layer by layer; and
(4.3) respectively calculating the partial derivative of parameters α^{(s) }and α^{(t) }with respect to the loss function J as
wherein Tr(⋅) represents the trace of a matrix; A^{(s) }is the partial derivative of the matrix L with respect to α^{(s) }and A^{(t) }is the partial derivative of the matrix L with respect to α^{(t)}, which are calculated with:
A_{pq}^{(s)}=−S_{pq}^{(s)}+δ(p=q)Σ_{a}S_{pq}^{(s) }
A_{pq}^{(t)}=−S_{pq}^{(t)}+δ(p=q)Σ_{a}S_{pq}^{(t) }
wherein δ(p=q) equals 1 when p=g, otherwise 0.
Preferably, in the step 5), recovering the depthunknown RGB image sequence specifically comprises steps of:
(5.1) processing the RGB image sequence with the spatialtemporal consistency superpixel segmentation, and calculating the spatial similarity matrix S^{(s) }and the temporal similarity matrix S^{(t)};
(5.2) apply forward propagation to the RGB image sequence with the convolution neural network trained, so as to obtain a single superpixel network output z;
(5.3) calculating the depth output {circumflex over (d)}=[{circumflex over (d)}_{1}; . . . ; {circumflex over (d)}_{n}] with spatialtemporal consistency constraint by:
{circumflex over (d)}=L
^{−1}
z
wherein the matrix L is calculated in the step (3.2); {circumflex over (d)}_{p }represents an estimated depth value of a superpixel p in the RGB image sequence; and
(5.4) applying {circumflex over (d)}_{p }to a corresponding position of a corresponding frame of the superpixel p for obtaining a depth map of the m frames.
Beneficial effects of the present invention are as follows.
First, in contrast to cuebased depth recovery methods, the present invention uses convolution neural networks to learn function mapping from RGB images to depth maps, which is independent of the particular assumptions of the scene.
Second, compared with singleframeoptimizing convolutional neural network depth recovery methods, the present invention adds the spatialtemporal consistency constraint, and jointly optimizes the multiframe images by constructing the spatialtemporal consistency condition random field loss layer, which is able to output spatialtemporally consistent depth map, so as to avoid interframe jump of the depth map.
Thirdly, compared with the conventional depth recovery method based on convolution neural network, the present invention adds the spatialtemporal consistency constraint, so as to improve the accuracy of the depth recovery.
The present invention has been compared with the conventional methods such as Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multiscale deep network.” Advances in neural information processing systems. 2014, though a public data set NYU depth v2 and a data set LYB 3DTV of the inventor. Results show that the method of the present invention can significantly improve the timedomain continuity of the depth map recovery, so as to improve the accuracy of the depth estimation.
Referring to the drawings, the present invention will be further illustrated.
Referring to
1) collecting a training set, wherein each training sample of the training set comprises a continuous RGB (red, green, blue) image sequence with m frames, and a corresponding depth map sequence;
2) using the method presented in Chang Jason et al. A video representation using temporal superpixels. CVPR 2013 to process each image sequence in the training set with spatialtemporal consistency superpixel segmentation, and establishing a spatial similarity matrix S^{(a) }and a temporal similarity matrix S^{(t)};
 3) 3) building the convolution neural network structure, wherein the convolution neural network comprises a single superpixel depth regression network with a parameter W, and a spatialtemporal consistency condition random field loss layer with a parameter α;
4) training the convolution neural network established in the step 3) with the continuous RGB image sequences and the corresponding depth map sequences in the training set, so as to obtain the parameter W and the parameter α; and
5) recovering a depth map sequence of a RGB image sequence with unknown depth through forward propagation of the trained convolution neural network.
According to the embodiment, the step 2) specifically comprises steps of:
(2.1) processing the continuous RGB image sequence in the training set with the spatialtemporal consistency superpixel segmentation, wherein an input sequence is marked as I=[I_{1}, . . . , I_{m}], where I_{t }is a tth frame of the m frames in total; the m frames are respectively divided into n_{1}, . . . , n_{m }superpixels by the spatialtemporal consistency superpixel segmentation while a corresponding relation between all superpixels in a later frame and superpixels corresponding to a same object in a former frame is generated; the whole image sequence comprises n=Σ_{t=1}^{m}n_{t }superpixels; marking the real depth at a gravity center of each superpixel p as d_{p}, and defining a groundtruth depth vector of the n superpixels as d=[d_{1}; . . . ; d_{n}];
(2.2) establishing the spatial similarity matrix S^{(s) }of the n superpixels, wherein S^{(s) }is an n×n matrix; S_{pq}^{(s) }represents a similarity relationship of a superpixel p and a superpixel q in one frame, where:
wherein c_{p }and c_{q }are color histogram features of the superpixel p and the superpixel q, and γ is a manually determined parameter which is set to the median of ∥c_{p}−c_{q}∥^{2 }of adjacent superpixels; and
(2.3) establishing the temporal similarity matrix S^{(t) }of the n superpixels, wherein S^{(t) }is an n×n matrix; S_{pq}^{(t) }represents a similarity relation of a superpixel p and a superpixel q in different frames:
wherein a corresponding relation between the superpixels of the adjacent frames is obtained by the spatialtemporal consistency superpixel segmentation of the step (2.1).
Preferably, in the step 3), the convolution neural network comprises the single superpixel depth regression network and a spatialtemporal consistency condition random field loss layer; wherein the step 3) specifically comprises steps of:
(3.1) for the single superpixel depth regression network comprising first 31 layers of a VGG16 network, a superpixel pooling layer and three fully connected layers, which performs average pooling in each superpixel space of the superpixel pooling layer; wherein an input of the network is continuous RGB images of m frames and the output of the network is an ndimensional vector z=[z_{1}, . . . z_{n}], in which the pth element z_{p }is the estimated depth value, without considering any constraint, of the superpixel p of the continuous RGB image sequence after the spatialtemporal consistency superpixel segmentation; the convolution neural network parameter need to be learned is W; and
(3.2) using the output z=[z_{1}, . . . z_{n}] of the single superpixel depth regression network obtained in the step (3.1), the real depth vector d=[d_{1}; . . . ; d_{n}] of the superpixels obtained in the step (2.1), and the spatial similarity matrix S_{pq}^{(s) }obtained in the step (2.2) as well as the temporal similarity matrix S_{pq}^{(t) }obtained in the step (2.3) as the input of the spatialtemporal consistency condition random field loss layer; wherein a conditional probability function of a spatialtemporal consistency condition random field is:
wherein an energy function E(d,I) is defined as:
wherein the first term Σ_{pϵN}(d_{p}−z_{p})^{2 }of the energy function refers to a difference between an estimated value and a real value of a single superpixel; a second term Σ_{(p,q)ϵS}α^{(s)}S_{pq}^{(s)}(d_{p}−d_{p})^{2 }is a spatial consistency constraint, which means depths will be similar if the superpixels p and q are adjacent in one frame with similar colors (i.e. S_{pq}^{(s) }is larger); a third term Σ_{(p,q)ϵJ}α^{(t)}S_{pq}^{(t)}(d_{p}−d_{p})^{2 }is a temporal consistency constraint, which means the depths will be similar if the superpixels p and q refer to a same object in adjacent frames (i.e. S_{pq}^{(t)}=1); a matrix form of the energy function is:
E(d,I)=d^{T}Ld−2z^{T}d+z^{T}z
wherein:
L=+D−M
M=α^{(s)}S^{(s)}+α^{(t)}S^{(t) }
wherein S^{(s) }is the spatial similarity matrix obtained in the step (2.2) and S^{(t) }is the temporal similarity matrix obtained in the step (2.3); α^{(s) }and α^{(t) }are two parameters to be learned; is an n×n unit matrix; D is a diagonal matrix, and D_{pp}=Σ_{q}M_{pq};
wherein:
wherein L^{−1 }is the inverse matrix of L, and L is the determinant of matrix L;
therefore, a loss function is defined as a negative logarithm of the conditional probability function:
Preferably, in the step 4), training the convolution neural network specifically comprises steps of:
(4.1) optimizing the parameters W, α^{(s) }and α^{(t) }using stochastic gradient descent, wherein for each iteration, the parameters are updated as:
wherein lr is the learning rate;
(4.2) calculating the partial derivative of parameter W partial with respect to the loss function J with:
wherein
is calculated with backward propagation of the convolution neural network layer by layer; and
(4.3) respectively calculating the partial derivative of parameters α^{(s) }and α^{(t) }with respect to the loss function J as
wherein Tr(⋅) represents the trace of a matrix; A^{(s) }is the partial derivative of the matrix L with respect to α^{(s) }and A^{(t) }is the partial derivative of the matrix L with respect to α^{(t)}, which are calculated with:
A_{pq}^{(s)}=−S_{pq}^{(s)}+δ(p=q)Σ_{a}S_{pq}^{(s) }
A_{pq}^{(t)}=−S_{pq}^{(t)}+δ(p=q)Σ_{a}S_{pq}^{(t) }
wherein δ(p=q) equals 1 when p=g, otherwise 0.
Preferably, in the step 5), recovering the depthunknown RGB image sequence specifically comprises steps of:
(5.1) processing the RGB image sequence with the spatialtemporal consistency superpixel segmentation, and calculating the spatial similarity matrix S^{(s) }and the temporal similarity matrix S^{(t)};
(5.2) apply forward propagation to the RGB image sequence with the convolution neural network trained, so as to obtain a single superpixel network output z;
(5.3) calculating the depth output {circumflex over (d)}=[{circumflex over (d)}_{1}; . . . ; {circumflex over (d)}_{n}] with spatialtemporal consistency constraint by:
{circumflex over (d)}=L
^{−1}
z
wherein the matrix L is calculated in the step (3.2); {circumflex over (d)}_{p }represents an estimated depth value of a superpixel p in the RGB image sequence; and
(5.4) applying {circumflex over (d)}_{p }to a corresponding position of a corresponding frame of the superpixel p for obtaining a depth map of the m frames.
The present invention has been compared with several existing methods based on a public data set NYU depth v2 and a data set LYB 3DTV of the inventor. In detail, NYU depth v2 comprises 795 training scenes and 654 testing scenes, wherein each scene comprises 30 frames of continuous RGB images and their corresponding depth maps. LYU 3DTV comprises scenes from TV series “Nirvana in Fire”, wherein 5124 frames from 60 scenes and their manuallyannotated depth maps are used as a training set, and 1278 frames from 20 scenes and their manuallymarked depth maps are used as a testing set. Depth recovery accuracy of the method of the present invention is compared with those of the following method:
1. Depth transfer: Karsch, Kevin, Ce Liu, and Sing Bing Kang. “Depth transfer: Depth extraction from video using nonparametric sampling.” IEEE transactions on pattern analysis and machine intelligence 36.11 (2014): 21442158.
2. discretecontinuous CRF: Liu, Miaomiao, Mathieu Salzmann, and Xuming He. “Discretecontinuous depth estimation from a single image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.
3. Multiscale CNN: Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multiscale deep network.” Advances in neural information processing systems. 2014 (Multiscale CNN),
4. 2DDCNF: Liu, Favao, et al. “Learning depth from single monocular images using deep convolutional neural fields.” IEEE transactions on pattern analysis and machine intelligence.
The results show that the accuracy of the method of the present invention is improved relative to the comparative methods, while the interframe jump during depth map recovery is significantly reduced.