Real-time Disparity Map Reconstruction with On-board FPGA by Semi-global Matching and Weighted Least Square Filtering

Pengfei Wang, Zhi Gao, Hailong Qin, Myo Tun Aung, Xudong Cheng, Mingjie Lao, Feng Lin, Swee Huat Rodney Teo
1. Temasek Laboratories, National University of Singapore
2. Roadstar.ai
3. Carnegie Mellon University

ABSTRACT

In this paper, we propose a real-time disparity map estimation framework on FPGA, combined with an effective post-processing method. Given the input stream of stereo image pairs, a semi-global matching based framework is implemented on the FPGA to estimate the disparity in real-time. The generated disparity map is refined with a weighted least square (WLS) filtering method. The experiments show that the disparity map can be reconstructed in real-time. In addition, the weighted least square filtering based post-processing can significantly improve the accuracy of the disparity map and remove large estimation errors.

1 INTRODUCTION

Stereo Vision, resulting in the knowledge of depth information in a scene, is of great importance in the field of robotics vision. Generally speaking, if the two images are rectified, matching pixels reside on corresponding horizontal lines. For each pixel $p_l(x, y)$ in the left image, its possible matching pixel is $p_r(x - d, y)$, where $d$ is the disparity for the matched pixels. The searching range of disparity is $[0, d_{max}]$, where $d_{max}$ is the maximum disparity, as shown in Fig. 1. The disparity map is composed by the disparity value. Once we get the disparity map, the depth of scenery can be calculated by a simple equation $z = fb/d$, where $b$ is the baseline, $f$ is the focal length. The algorithms for disparity estimation can be generally classified into three categories according to the matching cost, local method, global method and semi-global method. Block matching is one of the typical kind of local method, the matching cost is a summation of the difference within a small neighbourhood of each pixel in the left and right image. Local methods are often sensitive to noise. Global methods treat disparity estimation as a multi-label problem and construct a 2D graph optimized by algorithms of graph cut [1] or belief propagation [2, 3, 4]. Semi-global methods [5] are also used to approximate the NP-hard 2D graph as independent scan-lines and leverages dynamic programming to aggregate the matching cost. Global algorithms typically do not perform an aggregation step, but rather seek a disparity assignment (step 3) that minimizes a global cost function that combines data (step 1) and smoothness terms. While the 2D-optimization can be shown to be NP-hard for common classes of smoothness functions [6], dynamic programming can find the global minimum for independent scanlines in polynomial time. Semi-global method is our choice for disparity estimation as it has both high computation efficiency and accuracy, compared to local methods and global methods.

Nowadays, there exists many algorithms to build accurate correspondence between a pair of images. Besides, dedicated hardware platforms such as FPGAs and GPUs should be utilized to realize real-time processing through speeding up stereo vision systems. Some researches focus on stereo matching algorithms which use local methods or semi-global methods accelerated by GPUs [7]. [5] reached 12 fps at 450 × 375 resolution with 64 disparities and [8] achieved 8 fps for 320 × 240 pixel images with 64 disparity levels. FPGA costs less power and its memory hierarchies as well as processing units could be configured according to user needs [9], thus FPGA outperforms GPU to some extents. [10] achieved 60 fps at 1024 × 768 pixel stereo images by merging cross-based cost aggregation and mini-census transform. In [11], mini-census adaptive support region stereo matching algorithm was applied and the experiments in this paper shown that 47.6 fps for 1920 × 1080 with a disparity range of 256 and 129 fps for 1024 × 768 with 128 disparity could be attained respectively. [9] combined cost aggregation and fast locally consistent dense stereo methods. By combining the two aforementioned methods and testing on Xilinx Virtex-6 FPGAs, it reached 507.9 fps for 640 × 480 pixel images. Although this paper got low error rate with high frequency,它 could not obtain good results with high-definition image. 1600 × 1200 pixel images with 128 disparity levels at 42 fps was achieved in [12], whose test results were evaluated on Altera Stratix-V FPGAs.
2 Proposed Method

Considering the balance between accuracy and efficiency, we take semi-global method [5] as our framework for disparity estimation. The problem of disparity estimation can be modelled as an energy minimization problem, as shown in Equation 1.

\[
E(D) = \sum_{p} (C(p, D_p) + \sum_{q \in N_p} P_1 T[|D_p - D_q| = 1] + \sum_{q \in N_p} P_2 T[|D_p - D_q| > 1])
\]

(1)

where, \(N_p\) is the neighbour of \(p\), and \(P_1\) and \(P_2\) is the penalty for disparity variation for neighbouring pixels. The energy consists of pixel-wise matching cost, smoothness cost, and edge-preserving cost. This is an NP-hard problem, and it can be approximated by minimization along 4 individual scan-lines with dynamic programming. We also add a WLS filter as the post-processing to improve the performance after the disparity map is estimated. The whole algorithm consists of the following steps:

- Median filtering is applied to filter the speckle noise such as salt and pepper noise.
- Matching cost for each pixel is calculated by census transform and hamming distance.
- The cost of the pixels are aggregated along scan-lines with dynamic programming.
- For each pixel, the disparity with the smallest matching cost is selected with a winner-take-all strategy.
- Weighted least square filter based post-processing is applied to improve the result of disparity matching.

All the steps except the post-processing are implemented on FPGA.

2.1 Algorithm for Disparity Estimation

The input left and right images are rectified and fed into a 3 × 3 median filter. Census transform is applied on both the left and right images, as in Equation 2. The matching cost between \(p_l(x, y)\) and \(p_r(x - d, y)\) is determined as the hamming distance (Fig. 2) between the two binary vectors obtained by census transform, as in Equation 3. In order to enlarge the size of the neighbour without obviously increasing computation time, we implement a sparse census transform pattern with a window of size 9 × 9, in which only 16 pixels are counted in the census transform (Fig. 3). The matching cost is aggregated as Equation 4. After cost aggregation, the disparity with the minimal cost is selected for each pixel.

\[
\xi(p, p') = \begin{cases} 
1, & \text{if } I(p') < I(p) \\
0, & \text{otherwise}
\end{cases}
\]

(2)

\[
C(p, d_p) = \bigotimes_{p' \in N(p)} \xi(p, p')
\]

(3)

\[
L_r(p, d) = \min(L_r(p - r, d - 1) + P_1, L_r(p - r, d + 1) + P_1, \min_i L_r(p - r, i) + P_2) - \min_k L_r(p - r, k)
\]

(4)

2.2 FPGA Implementation for Disparity Estimation

The algorithm of the disparity estimation was implemented on a Xilinx ZYNQ XC7Z045 FPGA. The window size of median filter is 3 × 3. The window size for the census transform is 9 × 9. Pixels slide through 9 × 9 shift register.
representing the neighborhood and loaded into census computation block. The Hamming Distance is calculated through an XOR operation followed by count of ones, the summation of set bits. The census transform of left image is stored in a $d_{\text{max}}$ 16-bit shift register and XOR with the census transform of the right image. At each clock cycle, cost for all disparity levels are generated and stored in output BRAM.

For the step of cost aggregation, the 4 directions are split into two paths, forward pass and reverse path, as shown in Fig. 4. The cost calculation for each direction is performed using Equation 4. It concurrently computes the aggregation cost for all disparity levels and three paths. The implementation of the elementary block of the aggregation cost computation is shown in Fig. 5. The block is replicated $d_{\text{max}}$ times for each 3 paths, resulting $3 \times d_{\text{max}}$ blocks to process concurrently. The bottom multiplexer selects the output between the initial cost for the beginning of a path, and the currently computed value along the path.

After the cost aggregation, the disparity with the minimal cost for each pixel is to be selected. A serial tree structure comparator is applied for disparity selection. The implementation uses $d_{\text{max}}/2$ comparators with computation complexity of $O(d_{\text{max}})$, as shown in Fig. 6.

Fig. 7 shows the time schedule for each module in the processing of one frame. Data load, census transform, and hamming distance are pipe-lined. The generated cost by hamming distance is buffered in external DDR memory. The forward pass starts after DDR buffering. The reverse pass and depth map generation starts after forward pass completed.

2.3 Post-processing by Weighted Least Square Filtering

The accuracy of the disparity estimation is often suffered from extreme scenario, such as texture-less region, overexposure, repetitive structure, etc. In order to improve the accuracy of the disparity, post-processing is always necessary. We take the weighted least square filtering [13] for post-processing, for its good performance of edge preserving smoothing. The objective of filtering the disparity can be
expressed as minimizing Equation 5.

\[
\sum_p ((D_p' - D_p)^2 + \lambda (a_{x,p}(I)(\frac{\partial D}{\partial x})_p^2 + a_{y,p}(I)(\frac{\partial D}{\partial y})_p^2)) \quad (5)
\]

\(a_{x,p}(I)\) and \(a_{y,p}(I)\) are the smoothness weights as defined in Equation 6 as [14]:

\[
a_{x,p}(I) = \left( \left| \frac{\partial l}{\partial x}(p) \right| \right)^{\alpha} + \epsilon \right)^{-1}
\]
\[
a_{y,p}(I) = \left( \left| \frac{\partial l}{\partial y}(p) \right| \right)^{\alpha} + \epsilon \right)^{-1}
\]

where \(l\) is the log-luminance channel of the guidance image \(I\), the parameter \(\alpha\) determines the edge sharpness, while \(\epsilon\) is a small constant, for example, 0.0001.

3 EXPERIMENTS

The disparity map estimation algorithm is implemented on Xilinx ZYNQ ZC7045 FPGA, and the post-processing is implemented on ARM processor. The whole system is shown in Fig. 8.

In order to test the performance of the algorithm, we have carried out experiments on Middlebury dataset and self-collected data. The sample result of the image "Teddy" from the dataset is shown in Fig. 9. The result shows that the WLS filter can improve the accuracy of the estimated disparity.

For the self-collected data (Fig. 10), the result of the estimated disparity by semi-global matching suffers stride effect, especially at the right-top corner with overexposure, the disparity is totally wrong as it is a large texture-less region. The weighted least square filter can remove the stride effect, makes the disparity level more clear, and reduce the error caused by the overexposure.

4 CONCLUSION

In this paper, we have implemented a Semi-global disparity estimation algorithm on FPGA, and improve the disparity map through weighted least square filtering. In the disparity estimation, the sparse census transform has two folds, large window makes the descriptor more robust to noise, and the computation load increases not too much. The WLS filtering based can improve the accuracy of the estimated disparity.
map, and reduces some defects, such as error caused by over-exposure.

REFERENCES


