Improving Convolutional Networks with Self-calibrated Convolutions

29/02/2020 Jiang-Jiang Liu

Jiang-Jiang Liu^1*, Qibin Hou^2*, Ming-Ming Cheng¹, Changhu Wang³, Jiashi Feng²

¹CS, Nankai University ²NUS ³ByteDance AI Lab

Figure 1. Schematic illustration of the proposed self-calibrated convolutions. As can be seen, in self-calibrated convolutions, the original filters are separated into four portions, each of which is in charge of a different functionality. This makes self-calibrated convolutions quite different from traditional convolutions or grouped convolutions that are performed in a homogeneous way.

1. Abstract

Recent advances on CNNs are mostly devoted to design- ing more complex architectures to enhance their representation learning capacity. In this paper, we consider improv- ing the basic convolutional feature transformation process of CNNs without tuning the model architectures. To this end, we present a novel self-calibrated convolution that explicitly expands fields-of-view of each convolutional layer through internal communications and hence enriches the output features. In particular, unlike the standard convolutions that fuse spatial and channel-wise information using small kernels (e.g., 3 × 3), our self-calibrated convolution adaptively builds long-range spatial and inter-channel dependencies around each spatial location through a novel self-calibration operation. Thus, it can help CNNs generate more discriminative representations by explicitly incorporating richer information. Our self-calibrated convolution design is simple and generic, and can be easily applied to augment standard convolutional layers without introducing extra parameters and complexity. Extensive experiments demonstrate that when applying our self-calibrated convolution into different backbones, the baseline models can be significantly improved in a variety of vision tasks, including image recognition, object detection, instance segmentation, and keypoint detection, with no need to change net- work architectures. We hope this work could provide future research with a promising way of designing novel convolutional feature transformation for improving convolutional networks.

2. Paper

Improving Convolutional Networks with Self-Calibrated Convolutions, Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Changhu Wang, Jiashi Feng, IEEE CVPR, 2020. (*Equal contribution) [pdf|project|bib|code]

@inproceedings{liu2020scnet,
 title={Improving Convolutional Networks with Self-Calibrated Convolutions},
 author={Jiang-Jiang Liu and Qibin Hou and Ming-Ming Cheng and Changhu Wang and Jiashi Feng},
 booktitle={IEEE CVPR},
 year={2020},
}

3. Applications

Update:

2020.5.15
- Pretrained model of SCNet50_v1d with more than 2% improvement on ImageNet top1 acc (80.47 v.s. 77.81). compared with original version of SCNet-50 is released!
- SCNet50_v1d achieves comparable performance on other applications such as object detection and instance segmentation to our original SCNet101 version.
- Because of limited GPU resources, the pretrained model of SCNet101_v1d will be released later, as well as more applications’ results.

3.1 Classification

SC-Conv module can directly replace the bottleneck block with no other modification. Source code is available at https://github.com/MCG-NKU/SCNet .

model	#Params	MAdds	FLOPs	top-1 error	top-5 error	Link 1	Link 2
SCNet-50	25.56M	4.0G	7.9G	22.19	6.08	GoogleDrive	BaiduYun pwd: 95p5
SCNet-50_v1d	25.58M	4.7G	9.4G	19.53	4.68	GoogleDrive	BaiduYun pwd: hmmt
SCNet-101	25.70M	7.2G	14.4G	21.06	5.75	GoogleDrive	BaiduYun pwd: 38oh

Table 1.Performance of image classification on the ImageNet dataset.

3.2 Object detection

We use Faster R-CNN architecture with feature pyramid networks (FPNs) as baselines. We adopt the widely used mmdetection framework to run all our experiments.

Backbone	AP	AP.5	AP.75	APs	APm	APl
ResNet-50	37.6	59.4	40.4	21.9	41.2	48.4
SCNet-50	40.8	62.7	44.5	24.4	44.8	53.1
SCNet-50_v1d	41.8	62.9	45.5	24.8	45.3	54.8
ResNet-101	39.9	61.2	43.5	23.5	43.9	51.7
SCNet-101	42.0	63.7	45.5	24.4	46.3	54.6

Table 2.Performance of object detection on the COCO dataset.

3.3 Instance segmentation

We use Mask R-CNN architecture with feature pyramid networks (FPNs) as baselines. We adopt the widely used mmdetection framework to run all our experiments.

Backbone	AP	AP50	AP75	APs	APm	APl
ResNet-50	35.0	56.5	37.4	18.3	38.2	48.3
SCNet-50	37.2	59.9	39.5	17.8	40.3	54.2
SCNet-50_v1d	38.5	60.6	41.3	20.8	42.0	52.6
ResNet-101	36.7	58.6	39.3	19.3	40.3	50.9
SCNet-101	38.4	61.0	41.0	18.2	41.6	56.6

Table 3.Performance of instance segmentation on the COCO dataset.

3.4 Human keypoint detection

We use Simple Baselines as the baseline method for human keypoint detection. A Faster R-CNN object detector with detection AP of 56.4 for the ‘person’ category on COCO val2017 set is adopted for detection in the test phase.

Backbone	Scale	AP	Ap .5	AP .75	AP (M)	AP (L)
ResNet-50	256×192	70.6	88.9	78.2	67.2	77.4
SCNet-50	256×192	72.1	89.4	79.8	69.0	78.7
ResNet-50	384×288	71.9	89.2	78.6	67.7	79.6
SCNet-50	384×288	74.4	89.7	81.4	70.7	81.7
ResNet-101	256×192	71.6	88.9	79.3	68.5	78.2
SCNet-101	256×192	72.6	89.4	80.4	69.4	79.4
ResNet-101	384×288	73.9	89.6	80.5	70.3	81.1
SCNet-101	384×288	74.8	89.6	81.8	71.2	81.9

Table 4.Performance of human keypoint detection on the COCO dataset.