Res2Net: A New Multi-scale Backbone Architecture

07/08/2019 Shang-Hua Gao

Online Demo

Shanghua Gao ¹, Ming-Ming Cheng ¹, Kai Zhao ¹, Xin-Yu Zhang ¹, Ming-Hsuan Yang ², Philip Torr³
¹TKLNDST, CS, Nankai University ²UC, Merced ³University of Oxford

Figure 1. We propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, BigLittleNet, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models.

1. Abstract

Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layerwise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g. ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g. CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e. object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.

Source Code and pre-trained model: https://github.com/Res2Net

2. Paper

Res2Net: A New Multi-scale Backbone Architecture, Shang-Hua Gao#, Ming-Ming Cheng#*, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, Philip Torr, IEEE TPAMI, 43(2):652-662, 2021. [pdf | code | project |PPT | bib | 中译版|LaTeX ]

3. Applications

Res2Net is found to be useful in almost all computer vision applications we have tried so far. If you found it useful in your applications and want to share with others, please contact us to add a link in this project page.

News：

2020.10.20 PaddlePaddle version Res2Net achieves 85.13% top-1 acc. on ImageNet: PaddlePaddle Res2Net.
2020.8.21 Online demo for detection and segmentation using Res2Net is released http://mc.nankai.edu.cn/res2net-det
2020.7.29 The training code of Res2Net on ImageNet is released https://github.com/Res2Net/Res2Net-ImageNet-Training (non-commercial use only)
2020.6.1 Res2Net is now in the official model zoo of the new deep learning framework Jittor.
2020.5.21 Res2Net is now one of the basic bonebones in MMDetection v2 framework https://github.com/open-mmlab/mmdetection. Using MMDetection v2 with Res2Net achieves better performance with less computational cost.
2020.5.11 Res2Net achieves about 2% performance gain on Panoptic Segmentation based on detectron2 with no trick.
2020.3.14 Res2Net backbone allows latest interactive segmentation method to significantly reduce the number of required user interactions compared with best reported results.
2020.2.24 Our Res2Net_v1b achieves better detection performance on popular mmdetection platform, outperforming previous best results achieved by HRNet backbone, while consuming only about 50% of parameters and computings!
2020.2.21: Pretrained models of Res2Net_v1b with more than 2% improvement on ImageNet top1 acc. compared with PAMI version of Res2Net!

3.1 Classification

Res2Net module can replace the bottleneck block with no other modification.

We have implemented the Res2Net module into many state-of-the-art backbone networks: ResNet, ResNeXt, DLA, SE-NET, DLA, Big-Little Net. Source codes of those backbone models are available at https://github.com/gasvn/Res2Net .

model	#Params	GFLOPs	top-1 error	top-5 error	Link
Res2Net-50-48w-2s	25.29M	4.2	22.68	6.47	OneDrive
Res2Net-50-26w-4s	25.70M	4.2	22.01	6.15	OneDrive
Res2Net-50-14w-8s	25.06M	4.2	21.86	6.14	OneDrive
Res2Net-50-26w-6s	37.05M	6.3	21.42	5.87	OneDrive
Res2Net-50-26w-8s	48.40M	8.3	20.80	5.63	OneDrive
Res2Net-101-26w-4s	45.21M	8.1	20.81	5.57	OneDrive
Res2NeXt-50	24.67M	4.2	21.76	6.09	OneDrive
Res2Net-DLA-60	21.15M	4.2	21.53	5.80	OneDrive
Res2NeXt-DLA-60	17.33M	3.6	21.55	5.86	OneDrive
Res2Net-v1b-50	25.72M	4.5	19.73	4.96	Link
Res2Net-v1b-101	45.23M	8.3	18.77	4.64	Link

The download link from Baidu Disk is now available. (Baidu Disk password: vbix)

3.2 Pose estimation

Pose Estimation Task requires localization of person keypoints in challenging, uncontrolled conditions. This task involves simultaneously detecting people *and* localizing their keypoints.

We use Simple Baselines as the baseline method for Pose Estimation. Source code is available at https://github.com/gasvn/Res2Net-Pose-Estimation .

Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset

Arch	Input size	AP	Ap .5	AP .75	AP (M)	AP (L)
pose_resnet_50	256×192	0.704	0.886	0.783	0.671	0.772
pose_res2net_50	256×192	0.737	0.925	0.814	0.708	0.782
pose_resnet_101	256×192	0.714	0.893	0.793	0.681	0.781
pose_res2net_101	256×192	0.744	0.926	0.826	0.720	0.785

3.3 Instance segmentation

We use MaskRCNN as the baseline method for Instance segmentation and Object detection. We use the maskrcnn-benchmark as the baseline. Source code is available at https://github.com/gasvn/Res2Net-maskrcnn .

Instance segmentation is the combination of object detection and semantic segmentation. It requires not only the correct detection of objects with various sizes in an image but also the precise segmentation of each object.

Performance on Instance segmentation:

Backbone	Setting	AP	AP50	AP75	APs	APm	APl
ResNet-50	64w	33.9	55.2	36.0	14.8	36.0	50.9
ResNet-50	48w×2s	34.2	55.6	36.3	14.9	36.8	50.9
Res2Net-50	26w×4s	35.6	57.6	37.6	15.7	37.9	53.7
Res2Net-50	18w×6s	35.7	57.5	38.1	15.4	38.1	53.7
Res2Net-50	14w×8s	35.3	57.0	37.5	15.6	37.5	53.4
ResNet-101	64w	35.5	57.0	37.9	16.0	38.2	52.9
Res2Net-101	26w×4s	37.1	59.4	39.4	16.6	40.0	55.6

3.4 Object detection

Performance on Object detection:

Backbone	Setting	AP	AP50	AP75	APs	APm	APl
ResNet-50	64w	37.5	58.4	40.3	20.6	40.1	49.7
ResNet-50	48w×2s	38.0	58.9	41.3	20.5	41.0	49.9
Res2Net-50	26w×4s	39.6	60.9	43.1	22.0	42.3	52.8
Res2Net-50	18w×6s	39.9	60.9	43.3	21.8	42.8	53.7
Res2Net-50	14w×8s	39.1	60.2	42.1	21.7	41.7	52.8
ResNet-101	64w	39.6	60.6	43.2	22.0	43.2	52.4
Res2Net-101	26w×4s	41.8	62.6	45.6	23.4	45.5	55.6

3.5 Salient object detection

Precisely locating the salient object regions in an image requires an understanding of both large-scale context information for the determination of object saliency, as well as small-scale features to localize object boundaries accurately.

We use PoolNet (cvpr19) as the baseline method for Salient Object Detection . Source code is available at https://github.com/gasvn/Res2Net-PoolNet .

Results on salient object detection datasets without joint training with edge. Models are trained using DUTS-TR.

Backbone	ECSSD	PASCAL-S	DUT-O	HKU-IS	SOD	DUTS-TE
–	MaxF & MAE	MaxF & MAE	MaxF & MAE	MaxF & MAE	MaxF & MAE	MaxF & MAE
vgg	0.936 & 0.047	0.857 & 0.078	0.817 & 0.058	0.928 & 0.035	0.859 & 0.115	0.876 & 0.043
resnet50	0.940 & 0.042	0.863 & 0.075	0.830 & 0.055	0.934 & 0.032	0.867 & 0.100	0.886 & 0.040
res2net50	0.947 & 0.036	0.871 & 0.070	0.837 & 0.052	0.936 & 0.031	0.885 & 0.096	0.892 & 0.037

3.6 Segmantic segmentation

Segmantic segmentation results of Deeplab v3+ using ResNet/Res2Net as backbone model.

3.7 Detection benchmark (mmdetection) tasks

Backbone	Params.	GFLOPs	box AP
R-101-FPN	60.52M	283.14	39.4
X-101-64x4d-FPN	99.25M	440.36	41.3
HRNetV2p-W48	83.36M	459.66	41.5
Res2Net-101	61.18M	293.68	42.3

Comparison of Faster R-CNN based detection. The Res2Net based method achieves better results and significantly less computation and memory footprint. See more results for Mask-R-CNN, Cascade R-CNN, Cascade Mask R-CNN, and Hybrid Task Cascade in mmdetection benchmark.

3.8. Vectorized road extraction

Vectorized road extraction from Tan et. al. in CVPR 2020.

3.9 Interactive image segmentation

Interactive image segmentation from Lin et. al. in CVPR 2020, which

To achieve certain accuracy, the number of user interactions required by the new method is nearly half of the previous most powerful method!

3.10 Tumor segmentation on CT scans (from Sun et al. 2019)

Tumor segmentaton on CT scans. From: Sun et al. 2019.

3.11 Person Re-ID (from Cao et al )

Cao et al. use Res2Net to significantly boost performance of ReID applications.

3.12 Single-stage object detection (from Chen et al)

Chen et al. use Res2Net for one-stage object detection for CPU-only devices.

3.13 Depth prediction (from Weida Yang)

Weida Yang uses Res2Net for getting impressive depth detection results.

3.14 Semantic image to photo-realistic image translation

3.15 Res2NetPlus for solar panel detector

A solar panel detector from satellite imagery, developed by Less Wright, who found the Res2Net50 to have both greater accuracy (+5%), and steadier training. See also his blog or a Chinese translated version of Wright’s blog.

3.16 Speaker Verification

Zhou et. al. (IEEE SLT 2021) found that ResNeXt and Res2Net can significantly outperform the conventional ResNet model. The Res2Net model achieved superior performance by reducing the EER by 18.5% relative. Experiments on the other two internal test sets of mismatched conditions further confirmed the generalization of the ResNeXt and Res2Net architectures against noisy environment and segment length variations.

Zhou et. al. (IEEE SLT 2021) found that Res2Net achieves superior performance for processing speech data.

3.17 Protein Structure Prediction

Su et. al (Advanced Science 2021) found that Res2Net achieves super-performance for Protein structure prediction.

29 thoughts on “Res2Net: A New Multi-scale Backbone Architecture”

freshkfm

23/02/2022 at 14:33

作者你好，我最近看到这篇文章，仔细读过后很少倾佩，想问下这种新颖的想法是如何萌发的？此外，我之前也看到过一篇FractalNet的论文，大概和densenet同时期，它有点类似这个scale思想，不同的是各个通道间没有交互。此外resnext中的cardinality可以借助group_convolution实现，但是训练速度会大幅降低，该怎么优化？谢谢
- Gao Shanghua
  
  01/03/2022 at 19:03
  
  你好，res2net的最初思路源于如何高效低成本的实现多尺度特征的提取，我们通过类似resnext的分group的机制，在细粒度层面对同一层级的不同group的特征进行层次递进的处理，实现低计算开销的感受野组合爆炸效用。 fractalnet和densenet在跨block的粗粒度上进行特征融合，且每个后续节点大量使用前面层的特征，这样会导致特征的冗余和重复计算，从而计算复杂度相对较高，参数利用率有限。直观理解，你可以观察到在fractalnet和densenet里面的dense连接下，每一个中间节点都会接收前面的所有特征；而res2net是分组后层次性融合特征，并不是每个中间节点都会接收前面所有的特征信息。
  
  group conv的速度变慢是因为在分group后，我们可以在同样参数量计算量条件下使用更多的channel，但是由于现有硬件的优化问题，io替代计算成为新的瓶颈，所以显得group conv会慢。在定制化设备上和更新的硬件上，这一问题会得到解决。
czq

06/10/2021 at 12:47

我是一个初入深度学习领域的小白，想问一下关于res2net的deeplab实现。最后两个层的步长也是变为1吗，关于预训练权重的使用，请问进行deeplab实现时，改变了最后两层的步长，还是使用原有打预训练模型吗
- Shang-Hua GaoPost author
  
  06/10/2021 at 12:55
  
  最后两个层与resnet一样也是步长为1。改变了步长后，还是用原有的预训练模型
  - czq
    
    07/10/2021 at 18:07
    
    非常感谢您的解答
afei

23/04/2021 at 10:02
```
尊敬的作者，您好。我是刚入门图像识别的小白  看了认识net深受启发  但是有个问题不太明白  想咨询下您    D = int(math.floor(planes * (baseWidth / 64.0)))中的D指的啥    
```
- freshkfm
  
  23/02/2022 at 14:27
  
  我最近也在看这篇论文，这里你要参考resnext论文，d应该是指每组卷积核数量，下文的C是cardinality对应为组数
solauky

14/04/2021 at 11:20

请问现在1vb版还没有集成到mmdetection当中吗？谢谢~
- Shang-Hua GaoPost author
  
  14/04/2021 at 14:23
  
  目前集成到mmdetection里面的就是v1b。
  - solauky
    
    14/04/2021 at 15:07
    
    mmdetection目前只看到v1d版本啊？然后对比试跑了一下，跑不过HRNet40，我还以为列出的对比数据是用v1b来测试的。
    - Shang-Hua GaoPost author
      
      14/04/2021 at 16:06
      
      v1d是mmdetection给起的名字，’然后对比试跑了一下，跑不过HRNet40‘ 我不太清楚hrnet40的配置，你在mmdetection框架下能复现res2net和hrnet的结果吗？我这边之前测出来结果（也就是mmdetection上写的结果）是能以比hrnet40参数里低很多的情况下性能比它好一些。
      https://github.com/open-mmlab/mmdetection/tree/master/configs/hrnet
      https://github.com/open-mmlab/mmdetection/tree/master/configs/res2net
      - solauky
        
        14/04/2021 at 20:49
        
        可能和样本尺度分布有点关系吧。我用的建筑物数据集测试，里面小样本占比高。2x跑下来就大概res2net-101比HRNnet-W40的AP低1~2个点左右,但GPU占用和训练时间比HRnet少。后来又试了一下152+dnv，还是要低一些。而且网络加深以后，反而GPU占用和训练时间比HRnet多了。
Wang Shiqin

29/12/2020 at 11:09

请问一下，对于显著性目标检测任务，Res2Net解决的是多尺度的问题，解决的尺度范围有多大？Res2Net是否能解决small object的问题？
- Shang-Hua GaoPost author
  
  29/12/2020 at 11:26
  
  Res2Net可以通过增加scale的数量来增加对大尺度物体的表征。而对小物体的表征，Res2Net有一个分支是直连分支，这个分支保证Res2Net对小物体的处理能力。在显著性目标检测任务上，将ResNet替换成Res2Net在没有任何额外计算开销的前提下，可以有效提升检测精度。
kj172

17/09/2020 at 16:34

请问您的这个res2net和resnet34在分割的性能上比较了吗？哪个更好一点
- MM Cheng
  
  02/10/2020 at 12:08
  
  语义分割的性能在我们代码主页上有更新的结果，在mmdetection和detectron2上都取得了明显优于其他方法的性能。
Kelvin

25/04/2020 at 16:21

Hi there, may I know how do you do the addition of the separated channels before K3, K4… etc etc?

Thank you.
- Shang-Hua GaoPost author
  
  25/04/2020 at 16:28
  
  The channel number for different splits are even. So we can just do the addition. You can refer to more details in: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net_v1b.py
  - Kelvin
    
    25/04/2020 at 16:41
    
    Thank you so much Mr. Gao. I apologize for my silly question, as I did not realise that there was a padding needed in the convolutional operations. Take care! Appreciate the prompt reply!
ezra

24/03/2020 at 11:48

请问这些模型的FLOPs是可以精准的计算出来吗？有什么方便的工具吗？
- Shang-Hua GaoPost author
  
  24/03/2020 at 11:56
  
  建议去github上搜 pytorch flops 就可以搜出很多计算工具。
kururu

03/11/2019 at 20:16

为什么对特征分组，有什么依据吗？
- Shang-Hua GaoPost author
  
  03/11/2019 at 21:54
  
  group conv甚至depthwise conv 都已经证明特征分组有助于性能提升。本文在特征分组的基础上，添加了层内层级连接，进一步增强了网络的多尺度表达能力。
  - kururu
    
    03/11/2019 at 22:55
    
    谢谢作者的及时解答，层内层级连接增强网络多尺度表达能力，仅从增加特征数量来说似乎只是在原特征的基础上做了增强，是否有更详细理论的解释？我觉得从卷积核的尺度上来说并没有加入新的特征。
    - Shang-Hua GaoPost author
      
      03/11/2019 at 22:59
      
      我们的目的在于增加网络的多尺度表达能力，也就是感受野的丰富程度。也欢迎您从增加新的特征的角度改进此工作。
      - kururu
        
        03/11/2019 at 23:27
        
        感谢您的解惑，请问增加感受野的丰富程度，能解释得具体一点吗？
      - Shang-Hua GaoPost author
        
        04/11/2019 at 21:49
        
        feature经过bottleneck block，会有一个3×3和1×1两种感受野可能性；feature经过res2net module, feature每经过下一个3×3卷积或者直接与其他特征concat就用两种感受野可能性，由于层级结构，feature可能的感受野呈指数增长，所以是增加了感受野的丰富程度。
      - TJM
        
        13/12/2019 at 20:55
        
        请问，效率方面，这个网络架构，显然是增加了模型的宽度，那么在时间上是会更长了，相比与resnet50
      - Shang-Hua GaoPost author
        
        13/12/2019 at 21:09
        
        卷积的计算量与输入通道x输出通道成正比，res2net结构相比bottleneck结构，将一个具有较大输入输出通道的3×3卷积分解成多个输入输出通道数较小的卷积，所以并不是增加了模型的宽度。相反由于这种高效的连接方式，可以在相同计算量下大幅提升网络性能；也可以在相比resnet有相同性能时大幅降低计算量。已有多篇后续工作将res2net的结构应用到轻量级的分割、检测等任务上，都能实现很高效率。