Research

Multi-Level Context Ultra-Aggregation for Stereo Matching

Guang-Yu Nie§, Ming-Ming Chengß, Yun Liuß, Zhengfa Liang¶, Deng-Ping Fanß‌, Yue Liu§ℑ, Yongtian Wang§ℑ

§ Beijing Institute of Technology  ß TKLNDST, CS, Nankai University
¶ National Key Laboratory of Science and Technology on Blind Signal Processing
AICFVE, Beijing Film Academy


Abstract

Exploiting multi-level context information to cost volume can improve the performance of learning-based stereo matching methods. In recent years, 3-D Convolution Neural Networks (3-D CNNs) show the advantages in regularizing cost volume but are limited by unary features learning in matching cost computation. However, existing methods only use features from plain convolution layers or a simple aggregation of multi-level features to calculate cost volume, which is insufficient because stereo matching requires discriminative features to identify corresponding pixels in rectified stereo image pairs. In this paper, we propose a unary features descriptor using multi-level context ultra-aggregation (MCUA), which encapsulates all convolutional features into a more discriminative representation by intra- and inter-level features combination. Specifically, a child module that takes low-resolution images as input captures larger context information; the larger context information from each layer is densely connected to the main branch of the network. MCUA makes good usage of multi-level features with richer context and performs the image-to-image prediction holistically. We introduce our MCUA scheme for cost volume calculation and test it on PSM-Net. We also evaluate our method on Scene Flow and KITTI 2012/2015 stereo datasets. Experimental results show that our method outperforms state-of-the-art methods by a notable margin and effectively improves the accuracy of stereo matching.

Paper

  • Multi-Level Context Ultra-Aggregation for Stereo Matching. Guang-Yu Nie, Ming-Ming Cheng, Yun Liu, Zhengfa Liang, Deng-Ping Fan, Yue Liu, Yongtian Wang. IEEE CVPR, 2019. [pdf][supplementary materials][bib][ppt]

Source Code

(A new framework based on PyTorch is coming soon).

Motivation

We improve the discriminative ability of unary features for matching cost calculation by introducing Multi-level Context Ultra-Aggregation (MCUA) scheme which combines the features at the shallowest, smallest scale and deeper, larger scales using just “shallow” skip connections. Except for intra-level combination inspired by DenseNets and Deep Layer Aggregation, MCUA contains an independent child module which introduces the inter-level combination scheme.

Method

As illustrated in Fig.1, MCUA scheme allows each stage to receive the features from all previous stages and enables its outputs to pass through all subsequent stages. Branch (a) is the backbone, while branch (b) is the independent child module. Each colored block represents the feature map generated by one stage, while each green block denotes the receptive field that the next stage has.

Figure1 illustrates the MCUA scheme

Intra-level Combination

The intra-level combination fuses feature maps in each group, in which dense connection, described by dashed lines in Fig. 1, is applied between each of the two stages.

Inter-level Combination

As shown in the Fig.1, we use an independent child module to introduce inter-level aggregation which is represented by the solid color lines. The independent child module first adopts an average pooling operation, P0, to reduce the size of input by half, and then uses four stages (i.e., F0, …, F3) to learn unary features. Each of these four stages shares the same internal architecture with the first group of backbone, and parameters of corresponding layers are tied. Generally, large receptive fields are usually achieved at deep stages of a network. By using the independent child module, it can obtain large receptive fields at shallow stages.

Stereo Matching Method

emcuaarchitectureFigure2. The diagrammatic sketch of our proposed network (EMCUA Network).

EMCUA Network is constructed based on PSM-Net by applying MCUA scheme on the architecture of matching cost calculation and adding a residual module at the end. A pair of stereo images (i.e., Left, Right) pass through the network for disparity prediction (i.e., Output3).

As shown in Fig.2, MCUA Network contains three hourglass networks, each of which generates a disparity map. These three outputs are used to calculate loss when training the network, and the last output is used for testing. The output of the third hourglass network is considered as an initial disparity map. To refine the foreground of initial prediction, a residual module is added at the end of the network. It first generates a residual map and then combines with the initial disparity map using element-wise summation to obtain the final output, i.e., Output3. The whole network is named EMCUA Network, which is slightly different from MCUA in the last output.

Performance

We test our proposed model on three datasets (i.e., Scene Flow Datasets, KITTI2015/2012 Datasets) and compare it with the state-of-the-art architectures.

Performance on KITTI2015/2012 Datasets

EMCUA has the overall three-pixel-error of 2.09%/1.64% on KITTI2015/2012 dataset, and achieves 9.9%/13.2% decrease compared to PSM-Net, while MCUA has that of 2.14%/1.70%, and achieves 7.8%/10.1% decrease compared to PSM-Net. The results show that both EMCUA and MCUA outperform the state-of-the-art method (i.e., SegStereo), and the performance gain mainly comes from MCUA scheme. Furthermore, as shown in Tab. 2, EMCUA has the overall three-pixel-error of foreground/background of 4.27%/1.66% on KITTI2015 dataset, which achieves 2.5%/1.8% decrease compared to MCUA.
It shows that the residual module is mainly used to improve the performance of the accuracy of the foreground.

 

 

Performance on Scene Flow Datasets

As shown in Tab. 4, the end-point-error of MCUA is 0.56 pixels, which has a 50% increase over PSM-Net, and outperforms the state-of-the-art approach. As shown in blue boxes in the figure below, applying ultra-aggregation scheme helps the model to learn robust context information and accurately predicts disparity, especially for overlapped objects.

Contact

Email:  gyunie@outlook.com
Skype: live:guyuneeee
(Visited 3,101 times, 1 visits today)
Subscribe
Notify of
guest

3 Comments
Inline Feedbacks
View all comments
Todd Qi

Hello, amazing work!
When will the source code be released?

MM Cheng

This project is finished when GuangYu Nie is visiting our lab. The copyright of the source code belongs to Beijing Institute of Technology. You may contact the corresponding author to see if they want to share the source code.

MM Cheng

You may also be interested in another open source stereo estimation method from us: https://jwbian.net/sc-sfmlearner