Single-Image Depth Maps

I. Background

3D images and videos have become more prevalent; unfortunately, the process of making 3D content is still very complex and costly. Depth can be recovered either from binocular cues, such as stereo correspondence, or monocular cues, such as shading, perspective distortion, motion and texture.

Typically depth is achieved by taking multiple images with highly specific lighting conditions and using photogrammetry to infer features by stitching these images together. In computer graphics, depth typically takes the form of two representations: depth maps and point clouds.

Example of depth representation: Point cloud mesh of Stanford bunny

There have been various techniques for automatic 2D-to-3D conversion. Depth map estimation from 2D image sources plays an important role in this process.


A (2D) depth map image gives you the “depth” of the object or the “z” (spatial) information of the object in real world. The intensity values in the image represent the distance of the object from a viewpoint.

"Depth Map: Nearer is darker

Introduction

Initially I was trying to implement a paper of the same title (Single-Image Depth Map Estimation Using Blur Information). [3] The attached paper presents a novel approach for depth map estimation from a single image utilizing edge blur information. Literature has shown observers consistently used the blur of the boundary as a cue to relative depth. Edge blurs can resolve the near/far ambiguity inherent in depth-from-focus computations.

The steps in the paper are as follows:

(1) The blur amount at the edge is calculated from the gradient magnitude ratio of the input and re-blurred images.

(2) The Canny edge detection algorithm was used and was tuned for uniform estimation of edges with different magnitudes.

Canny Edge Detection

Canny edge detection is a multi-step algorithm that can detect edges with noise supressed at the same time.

Smooth the image with a Gaussian filter to reduce noise and unwanted details and textures.

\[ g(m,n)=G_{\sigma}(m,n)*f(m,n) \] where \[ G_{\sigma}=\frac{1}{\sqrt{2\pi\sigma^2}}exp\left(-\frac{m^2+n^2}{2\sigma^2}\right) \]

Compute gradient of $g(m,n)$ using any of the gradient operatiors (Roberts, Sobel, Prewitt, etc) to get: \[ M(n,n)=\sqrt{g_m^2(m,n)+g_n^2(m,n)} \]

and

\[ \theta(m,n)=tan^{-1}[g_n(m,n)/g_m(m,n)] \] Threshold M: \[ M_T(m,n)=\left\{ \begin{array}{ll} M(m,n) & \mbox{if $M(m,n)>T$} \\ 0 & \mbox{otherwise} \end{array} \right. \]

where T is chosen so that all edge elements are kept while most of the noise is suppressed.

Suppress non-maxima pixels in the edges in $M_T$ obtained above to thin the edge ridges (as the edges might have been broadened in step 1). To do so, check to see whether each non-zero $M_T(m,n)$ is greater than its two neighbors along the gradient direction $\theta(m,n)$. If so, keep $M_T(m,n)$ unchanged, otherwise, set it to 0.

Threshold the previous result by two different thresholds $\tau_1$ and $\tau_2$ (where $\tau_1<\tau_2$) to obtain two binary images $T_1$ and $T_2$. Note that $T_2$ with greater $\tau_2$ has less noise and fewer false edges but greater gaps between edge segments, when compared to $T_1$ with smaller $\tau_1$.

Link edge segments in $T_2$ to form continuous edges. To do so, trace each segment in $T_2$ to its end and then search its neighbors in $T_1$ to find any edge segment in $T_1$ to bridge the gap untill reaching another edge segment in $T_2$.

$${\delta u \over \delta t} = \nabla \cdot (D \nabla u) $$

$$D = \begin{pmatrix} D_{11} & D_{12} \\ D_{12} & D_{22} \\ \end{pmatrix}$$

The final depth map can be obtained by propagating estimated information from the edges to the entire image using cross-bilateral filtering which preserves edges and acts as a noise-reducing smoothing filter for images.

Example image of cross-bilateral filtering. (left) original sample image right using cross-bilaterial filtering

Results

Their results for real images demonstrate the effectiveness of this method in providing a feasible depth map for 2D-to-3D conversion that generates comfortable stereo viewing.

I initially attempted to execute the aforementioned paper but I was not getting the required results.

Output from trying to implement paper [1]
(from left to right) input image, output image, desired image

I utilized the algorithm in the original paper to estimate the blur map but I altered the map propagation after reading different papers

However, the original paper had many important takeaways: recovering depth from a single defocused image captured by an uncalibrated camera without any prior information.

Ideally, if you know the intrinsic parameters of camera then you can find the depth using single image.

Parallel Processing

I sought to make an optimized version of existing algorithms that can be run in real-time

It is mentioned that it takes about one minute to process an image of size 800x600 it will be 30s for image of size 400x300

Therefore, there needed to be GPU/parallel implementation of these algorithms. This was done by dividing image into minimum 4 parts and running each part on different threads and after processing combining results. (MATLAB parallel pool)

(from left to right) output after parallel processing
inverted output
Comparison to original results (left) my result (right) literature result
Histogram of my results
Histogram of original results
Highlighted image difference

This shows that overall the depth analysis worked for the bird with the exception of background features.

For this exercise, I have tried using MATLAB parallel processing toolbox, but it is not useful for real-time. About 25 seconds are required to process a single frame.

After it has been run one time to implement again takes a shorter time ( about 10s)

Future Improvements

In order to successfully implement a depth map there may need to be compromises. For example, extracting depth using two-images as opposed to a single image.

Since computational complexity was resolved in this exercise by employing parallel processing what could have been interesting would be to apply machine learning to further modify the output image (this is the logical next step)

This should not be shipped in MATLAB, only for prototyping. Essentially this would be of use in Java (for Android)


References