Res2Net: A New Multi-scale Backbone Architecture
Online Demo
Shanghua Gao 1, Ming-Ming Cheng 1, Kai Zhao 1, Xin-Yu Zhang 1, Ming-Hsuan Yang 2, Philip Torr3
1TKLNDST, CS, Nankai University 2UC, Merced 3University of Oxford

1. Abstract
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layerwise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g. ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g. CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e. object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.
Source Code and pre-trained model: https://github.com/Res2Net
2. Paper
- Res2Net: A New Multi-scale Backbone Architecture, Shang-Hua Gao#, Ming-Ming Cheng#*, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, Philip Torr, IEEE TPAMI, 43(2):652-662, 2021. [pdf | code | project |PPT | bib | 中译版|LaTeX ]
3. Applications
Res2Net is found to be useful in almost all computer vision applications we have tried so far. If you found it useful in your applications and want to share with others, please contact us to add a link in this project page.
News:
- 2020.10.20 PaddlePaddle version Res2Net achieves 85.13% top-1 acc. on ImageNet: PaddlePaddle Res2Net.
- 2020.8.21 Online demo for detection and segmentation using Res2Net is released http://mc.nankai.edu.cn/res2net-det
- 2020.7.29 The training code of Res2Net on ImageNet is released https://github.com/Res2Net/Res2Net-ImageNet-Training (non-commercial use only)
- 2020.6.1 Res2Net is now in the official model zoo of the new deep learning framework Jittor.
- 2020.5.21 Res2Net is now one of the basic bonebones in MMDetection v2 framework https://github.com/open-mmlab/mmdetection. Using MMDetection v2 with Res2Net achieves better performance with less computational cost.
- 2020.5.11 Res2Net achieves about 2% performance gain on Panoptic Segmentation based on detectron2 with no trick.
- 2020.3.14 Res2Net backbone allows latest interactive segmentation method to significantly reduce the number of required user interactions compared with best reported results.
- 2020.2.24 Our Res2Net_v1b achieves better detection performance on popular mmdetection platform, outperforming previous best results achieved by HRNet backbone, while consuming only about 50% of parameters and computings!
- 2020.2.21: Pretrained models of Res2Net_v1b with more than 2% improvement on ImageNet top1 acc. compared with PAMI version of Res2Net!
3.1 Classification
Res2Net module can replace the bottleneck block with no other modification.
We have implemented the Res2Net module into many state-of-the-art backbone networks: ResNet, ResNeXt, DLA, SE-NET, DLA, Big-Little Net. Source codes of those backbone models are available at https://github.com/gasvn/Res2Net .
model | #Params | GFLOPs | top-1 error | top-5 error | Link |
---|---|---|---|---|---|
Res2Net-50-48w-2s | 25.29M | 4.2 | 22.68 | 6.47 | OneDrive |
Res2Net-50-26w-4s | 25.70M | 4.2 | 22.01 | 6.15 | OneDrive |
Res2Net-50-14w-8s | 25.06M | 4.2 | 21.86 | 6.14 | OneDrive |
Res2Net-50-26w-6s | 37.05M | 6.3 | 21.42 | 5.87 | OneDrive |
Res2Net-50-26w-8s | 48.40M | 8.3 | 20.80 | 5.63 | OneDrive |
Res2Net-101-26w-4s | 45.21M | 8.1 | 20.81 | 5.57 | OneDrive |
Res2NeXt-50 | 24.67M | 4.2 | 21.76 | 6.09 | OneDrive |
Res2Net-DLA-60 | 21.15M | 4.2 | 21.53 | 5.80 | OneDrive |
Res2NeXt-DLA-60 | 17.33M | 3.6 | 21.55 | 5.86 | OneDrive |
Res2Net-v1b-50 | 25.72M | 4.5 | 19.73 | 4.96 | Link |
Res2Net-v1b-101 | 45.23M | 8.3 | 18.77 | 4.64 | Link |
The download link from Baidu Disk is now available. (Baidu Disk password: vbix)
3.2 Pose estimation

We use Simple Baselines as the baseline method for Pose Estimation. Source code is available at https://github.com/gasvn/Res2Net-Pose-Estimation .
Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset
Arch | Input size | AP | Ap .5 | AP .75 | AP (M) | AP (L) |
---|---|---|---|---|---|---|
pose_resnet_50 | 256×192 | 0.704 | 0.886 | 0.783 | 0.671 | 0.772 |
pose_res2net_50 | 256×192 | 0.737 | 0.925 | 0.814 | 0.708 | 0.782 |
pose_resnet_101 | 256×192 | 0.714 | 0.893 | 0.793 | 0.681 | 0.781 |
pose_res2net_101 | 256×192 | 0.744 | 0.926 | 0.826 | 0.720 | 0.785 |
3.3 Instance segmentation
We use MaskRCNN as the baseline method for Instance segmentation and Object detection. We use the maskrcnn-benchmark as the baseline. Source code is available at https://github.com/gasvn/Res2Net-maskrcnn .

Performance on Instance segmentation:
Backbone | Setting | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
ResNet-50 | 64w | 33.9 | 55.2 | 36.0 | 14.8 | 36.0 | 50.9 |
ResNet-50 | 48w×2s | 34.2 | 55.6 | 36.3 | 14.9 | 36.8 | 50.9 |
Res2Net-50 | 26w×4s | 35.6 | 57.6 | 37.6 | 15.7 | 37.9 | 53.7 |
Res2Net-50 | 18w×6s | 35.7 | 57.5 | 38.1 | 15.4 | 38.1 | 53.7 |
Res2Net-50 | 14w×8s | 35.3 | 57.0 | 37.5 | 15.6 | 37.5 | 53.4 |
ResNet-101 | 64w | 35.5 | 57.0 | 37.9 | 16.0 | 38.2 | 52.9 |
Res2Net-101 | 26w×4s | 37.1 | 59.4 | 39.4 | 16.6 | 40.0 | 55.6 |
3.4 Object detection
We use MaskRCNN as the baseline method for Instance segmentation and Object detection. We use the maskrcnn-benchmark as the baseline. Source code is available at https://github.com/gasvn/Res2Net-maskrcnn .
Performance on Object detection:
Backbone | Setting | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
ResNet-50 | 64w | 37.5 | 58.4 | 40.3 | 20.6 | 40.1 | 49.7 |
ResNet-50 | 48w×2s | 38.0 | 58.9 | 41.3 | 20.5 | 41.0 | 49.9 |
Res2Net-50 | 26w×4s | 39.6 | 60.9 | 43.1 | 22.0 | 42.3 | 52.8 |
Res2Net-50 | 18w×6s | 39.9 | 60.9 | 43.3 | 21.8 | 42.8 | 53.7 |
Res2Net-50 | 14w×8s | 39.1 | 60.2 | 42.1 | 21.7 | 41.7 | 52.8 |
ResNet-101 | 64w | 39.6 | 60.6 | 43.2 | 22.0 | 43.2 | 52.4 |
Res2Net-101 | 26w×4s | 41.8 | 62.6 | 45.6 | 23.4 | 45.5 | 55.6 |
3.5 Salient object detection

We use PoolNet (cvpr19) as the baseline method for Salient Object Detection . Source code is available at https://github.com/gasvn/Res2Net-PoolNet .
Results on salient object detection datasets without joint training with edge. Models are trained using DUTS-TR.
Backbone | ECSSD | PASCAL-S | DUT-O | HKU-IS | SOD | DUTS-TE |
---|---|---|---|---|---|---|
– | MaxF & MAE | MaxF & MAE | MaxF & MAE | MaxF & MAE | MaxF & MAE | MaxF & MAE |
vgg | 0.936 & 0.047 | 0.857 & 0.078 | 0.817 & 0.058 | 0.928 & 0.035 | 0.859 & 0.115 | 0.876 & 0.043 |
resnet50 | 0.940 & 0.042 | 0.863 & 0.075 | 0.830 & 0.055 | 0.934 & 0.032 | 0.867 & 0.100 | 0.886 & 0.040 |
res2net50 | 0.947 & 0.036 | 0.871 & 0.070 | 0.837 & 0.052 | 0.936 & 0.031 | 0.885 & 0.096 | 0.892 & 0.037 |
3.6 Segmantic segmentation

3.7 Detection benchmark (mmdetection) tasks
Backbone | Params. | GFLOPs | box AP |
---|---|---|---|
R-101-FPN | 60.52M | 283.14 | 39.4 |
X-101-64x4d-FPN | 99.25M | 440.36 | 41.3 |
HRNetV2p-W48 | 83.36M | 459.66 | 41.5 |
Res2Net-101 | 61.18M | 293.68 | 42.3 |
3.8. Vectorized road extraction

3.9 Interactive image segmentation


3.10 Tumor segmentation on CT scans (from Sun et al. 2019)

3.11 Person Re-ID (from Cao et al )

3.12 Single-stage object detection (from Chen et al)

3.13 Depth prediction (from Weida Yang)

3.14 Semantic image to photo-realistic image translation

3.15 Res2NetPlus for solar panel detector
A solar panel detector from satellite imagery, developed by Less Wright, who found the Res2Net50 to have both greater accuracy (+5%), and steadier training. See also his blog or a Chinese translated version of Wright’s blog.
3.16 Speaker Verification
Zhou et. al. (IEEE SLT 2021) found that ResNeXt and Res2Net can significantly outperform the conventional ResNet model. The Res2Net model achieved superior performance by reducing the EER by 18.5% relative. Experiments on the other two internal test sets of mismatched conditions further confirmed the generalization of the ResNeXt and Res2Net architectures against noisy environment and segment length variations.

3.17 Protein Structure Prediction
