Res2Net: A New Multi-scale Backbone Architecture
Online Demo
Shanghua Gao 1, Ming-Ming Cheng 1, Kai Zhao 1, Xin-Yu Zhang 1, Ming-Hsuan Yang 2, Philip Torr3
1TKLNDST, CS, Nankai University 2UC, Merced 3University of Oxford
1. Abstract
Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layerwise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g. ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g. CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e. object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.
Source Code and pre-trained model: https://github.com/Res2Net
2. Paper
- Res2Net: A New Multi-scale Backbone Architecture, Shang-Hua Gao#, Ming-Ming Cheng#*, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, Philip Torr, IEEE TPAMI, 43(2):652-662, 2021. [pdf | code | project |PPT | bib | 中译版|LaTeX ]
3. Applications
Res2Net is found to be useful in almost all computer vision applications we have tried so far. If you found it useful in your applications and want to share with others, please contact us to add a link in this project page.
News:
- 2020.10.20 PaddlePaddle version Res2Net achieves 85.13% top-1 acc. on ImageNet: PaddlePaddle Res2Net.
- 2020.8.21 Online demo for detection and segmentation using Res2Net is released http://mc.nankai.edu.cn/res2net-det
- 2020.7.29 The training code of Res2Net on ImageNet is released https://github.com/Res2Net/Res2Net-ImageNet-Training (non-commercial use only)
- 2020.6.1 Res2Net is now in the official model zoo of the new deep learning framework Jittor.
- 2020.5.21 Res2Net is now one of the basic bonebones in MMDetection v2 framework https://github.com/open-mmlab/mmdetection. Using MMDetection v2 with Res2Net achieves better performance with less computational cost.
- 2020.5.11 Res2Net achieves about 2% performance gain on Panoptic Segmentation based on detectron2 with no trick.
- 2020.3.14 Res2Net backbone allows latest interactive segmentation method to significantly reduce the number of required user interactions compared with best reported results.
- 2020.2.24 Our Res2Net_v1b achieves better detection performance on popular mmdetection platform, outperforming previous best results achieved by HRNet backbone, while consuming only about 50% of parameters and computings!
- 2020.2.21: Pretrained models of Res2Net_v1b with more than 2% improvement on ImageNet top1 acc. compared with PAMI version of Res2Net!
3.1 Classification
Res2Net module can replace the bottleneck block with no other modification.
We have implemented the Res2Net module into many state-of-the-art backbone networks: ResNet, ResNeXt, DLA, SE-NET, DLA, Big-Little Net. Source codes of those backbone models are available at https://github.com/gasvn/Res2Net .
model | #Params | GFLOPs | top-1 error | top-5 error | Link |
---|---|---|---|---|---|
Res2Net-50-48w-2s | 25.29M | 4.2 | 22.68 | 6.47 | OneDrive |
Res2Net-50-26w-4s | 25.70M | 4.2 | 22.01 | 6.15 | OneDrive |
Res2Net-50-14w-8s | 25.06M | 4.2 | 21.86 | 6.14 | OneDrive |
Res2Net-50-26w-6s | 37.05M | 6.3 | 21.42 | 5.87 | OneDrive |
Res2Net-50-26w-8s | 48.40M | 8.3 | 20.80 | 5.63 | OneDrive |
Res2Net-101-26w-4s | 45.21M | 8.1 | 20.81 | 5.57 | OneDrive |
Res2NeXt-50 | 24.67M | 4.2 | 21.76 | 6.09 | OneDrive |
Res2Net-DLA-60 | 21.15M | 4.2 | 21.53 | 5.80 | OneDrive |
Res2NeXt-DLA-60 | 17.33M | 3.6 | 21.55 | 5.86 | OneDrive |
Res2Net-v1b-50 | 25.72M | 4.5 | 19.73 | 4.96 | Link |
Res2Net-v1b-101 | 45.23M | 8.3 | 18.77 | 4.64 | Link |
The download link from Baidu Disk is now available. (Baidu Disk password: vbix)
3.2 Pose estimation
We use Simple Baselines as the baseline method for Pose Estimation. Source code is available at https://github.com/gasvn/Res2Net-Pose-Estimation .
Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset
Arch | Input size | AP | Ap .5 | AP .75 | AP (M) | AP (L) |
---|---|---|---|---|---|---|
pose_resnet_50 | 256×192 | 0.704 | 0.886 | 0.783 | 0.671 | 0.772 |
pose_res2net_50 | 256×192 | 0.737 | 0.925 | 0.814 | 0.708 | 0.782 |
pose_resnet_101 | 256×192 | 0.714 | 0.893 | 0.793 | 0.681 | 0.781 |
pose_res2net_101 | 256×192 | 0.744 | 0.926 | 0.826 | 0.720 | 0.785 |
3.3 Instance segmentation
We use MaskRCNN as the baseline method for Instance segmentation and Object detection. We use the maskrcnn-benchmark as the baseline. Source code is available at https://github.com/gasvn/Res2Net-maskrcnn .
Performance on Instance segmentation:
Backbone | Setting | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
ResNet-50 | 64w | 33.9 | 55.2 | 36.0 | 14.8 | 36.0 | 50.9 |
ResNet-50 | 48w×2s | 34.2 | 55.6 | 36.3 | 14.9 | 36.8 | 50.9 |
Res2Net-50 | 26w×4s | 35.6 | 57.6 | 37.6 | 15.7 | 37.9 | 53.7 |
Res2Net-50 | 18w×6s | 35.7 | 57.5 | 38.1 | 15.4 | 38.1 | 53.7 |
Res2Net-50 | 14w×8s | 35.3 | 57.0 | 37.5 | 15.6 | 37.5 | 53.4 |
ResNet-101 | 64w | 35.5 | 57.0 | 37.9 | 16.0 | 38.2 | 52.9 |
Res2Net-101 | 26w×4s | 37.1 | 59.4 | 39.4 | 16.6 | 40.0 | 55.6 |
3.4 Object detection
We use MaskRCNN as the baseline method for Instance segmentation and Object detection. We use the maskrcnn-benchmark as the baseline. Source code is available at https://github.com/gasvn/Res2Net-maskrcnn .
Performance on Object detection:
Backbone | Setting | AP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
ResNet-50 | 64w | 37.5 | 58.4 | 40.3 | 20.6 | 40.1 | 49.7 |
ResNet-50 | 48w×2s | 38.0 | 58.9 | 41.3 | 20.5 | 41.0 | 49.9 |
Res2Net-50 | 26w×4s | 39.6 | 60.9 | 43.1 | 22.0 | 42.3 | 52.8 |
Res2Net-50 | 18w×6s | 39.9 | 60.9 | 43.3 | 21.8 | 42.8 | 53.7 |
Res2Net-50 | 14w×8s | 39.1 | 60.2 | 42.1 | 21.7 | 41.7 | 52.8 |
ResNet-101 | 64w | 39.6 | 60.6 | 43.2 | 22.0 | 43.2 | 52.4 |
Res2Net-101 | 26w×4s | 41.8 | 62.6 | 45.6 | 23.4 | 45.5 | 55.6 |
3.5 Salient object detection
We use PoolNet (cvpr19) as the baseline method for Salient Object Detection . Source code is available at https://github.com/gasvn/Res2Net-PoolNet .
Results on salient object detection datasets without joint training with edge. Models are trained using DUTS-TR.
Backbone | ECSSD | PASCAL-S | DUT-O | HKU-IS | SOD | DUTS-TE |
---|---|---|---|---|---|---|
– | MaxF & MAE | MaxF & MAE | MaxF & MAE | MaxF & MAE | MaxF & MAE | MaxF & MAE |
vgg | 0.936 & 0.047 | 0.857 & 0.078 | 0.817 & 0.058 | 0.928 & 0.035 | 0.859 & 0.115 | 0.876 & 0.043 |
resnet50 | 0.940 & 0.042 | 0.863 & 0.075 | 0.830 & 0.055 | 0.934 & 0.032 | 0.867 & 0.100 | 0.886 & 0.040 |
res2net50 | 0.947 & 0.036 | 0.871 & 0.070 | 0.837 & 0.052 | 0.936 & 0.031 | 0.885 & 0.096 | 0.892 & 0.037 |
3.6 Segmantic segmentation
3.7 Detection benchmark (mmdetection) tasks
Backbone | Params. | GFLOPs | box AP |
---|---|---|---|
R-101-FPN | 60.52M | 283.14 | 39.4 |
X-101-64x4d-FPN | 99.25M | 440.36 | 41.3 |
HRNetV2p-W48 | 83.36M | 459.66 | 41.5 |
Res2Net-101 | 61.18M | 293.68 | 42.3 |
3.8. Vectorized road extraction
3.9 Interactive image segmentation
3.10 Tumor segmentation on CT scans (from Sun et al. 2019)
3.11 Person Re-ID (from Cao et al )
3.12 Single-stage object detection (from Chen et al)
3.13 Depth prediction (from Weida Yang)
3.14 Semantic image to photo-realistic image translation
3.15 Res2NetPlus for solar panel detector
A solar panel detector from satellite imagery, developed by Less Wright, who found the Res2Net50 to have both greater accuracy (+5%), and steadier training. See also his blog or a Chinese translated version of Wright’s blog.
3.16 Speaker Verification
Zhou et. al. (IEEE SLT 2021) found that ResNeXt and Res2Net can significantly outperform the conventional ResNet model. The Res2Net model achieved superior performance by reducing the EER by 18.5% relative. Experiments on the other two internal test sets of mismatched conditions further confirmed the generalization of the ResNeXt and Res2Net architectures against noisy environment and segment length variations.
作者你好,我最近看到这篇文章,仔细读过后很少倾佩,想问下这种新颖的想法是如何萌发的?此外,我之前也看到过一篇FractalNet的论文,大概和densenet同时期,它有点类似这个scale思想,不同的是各个通道间没有交互。此外resnext中的cardinality可以借助group_convolution实现,但是训练速度会大幅降低,该怎么优化?谢谢
你好,res2net的最初思路源于如何高效低成本的实现多尺度特征的提取,我们通过类似resnext的分group的机制,在细粒度层面对同一层级的不同group的特征进行层次递进的处理,实现低计算开销的感受野组合爆炸效用。 fractalnet和densenet在跨block的粗粒度上进行特征融合,且每个后续节点大量使用前面层的特征,这样会导致特征的冗余和重复计算,从而计算复杂度相对较高,参数利用率有限。直观理解,你可以观察到在fractalnet和densenet里面的dense连接下,每一个中间节点都会接收前面的所有特征;而res2net是分组后层次性融合特征,并不是每个中间节点都会接收前面所有的特征信息。
group conv的速度变慢是因为在分group后,我们可以在同样参数量计算量条件下使用更多的channel,但是由于现有硬件的优化问题,io替代计算成为新的瓶颈,所以显得group conv会慢。在定制化设备上和更新的硬件上,这一问题会得到解决。
我是一个初入深度学习领域的小白,想问一下关于res2net的deeplab实现。最后两个层的步长也是变为1吗,关于预训练权重的使用,请问进行deeplab实现时,改变了最后两层的步长,还是使用原有打预训练模型吗
最后两个层与resnet一样也是步长为1。改变了步长后,还是用原有的预训练模型
非常感谢您的解答
我最近也在看这篇论文,这里你要参考resnext论文,d应该是指每组卷积核数量,下文的C是cardinality对应为组数
请问现在1vb版还没有集成到mmdetection当中吗?谢谢~
目前集成到mmdetection里面的就是v1b。
mmdetection目前只看到v1d版本啊?然后对比试跑了一下,跑不过HRNet40,我还以为列出的对比数据是用v1b来测试的。
v1d是mmdetection给起的名字,’然后对比试跑了一下,跑不过HRNet40‘ 我不太清楚hrnet40的配置,你在mmdetection框架下能复现res2net和hrnet的结果吗?我这边之前测出来结果(也就是mmdetection上写的结果)是能以比hrnet40参数里低很多的情况下性能比它好一些。
https://github.com/open-mmlab/mmdetection/tree/master/configs/hrnet
https://github.com/open-mmlab/mmdetection/tree/master/configs/res2net
可能和样本尺度分布有点关系吧。我用的建筑物数据集测试,里面小样本占比高。2x跑下来就大概res2net-101比HRNnet-W40的AP低1~2个点左右,但GPU占用和训练时间比HRnet少。后来又试了一下152+dnv,还是要低一些。而且网络加深以后,反而GPU占用和训练时间比HRnet多了。
请问一下,对于显著性目标检测任务,Res2Net解决的是多尺度的问题,解决的尺度范围有多大?Res2Net是否能解决small object的问题?
Res2Net可以通过增加scale的数量来增加对大尺度物体的表征。而对小物体的表征,Res2Net有一个分支是直连分支,这个分支保证Res2Net对小物体的处理能力。在显著性目标检测任务上,将ResNet替换成Res2Net在没有任何额外计算开销的前提下,可以有效提升检测精度。
请问您的这个res2net和resnet34在分割的性能上比较了吗?哪个更好一点
语义分割的性能在我们代码主页上有更新的结果,在mmdetection和detectron2上都取得了明显优于其他方法的性能。
Hi there, may I know how do you do the addition of the separated channels before K3, K4… etc etc?
Thank you.
The channel number for different splits are even. So we can just do the addition. You can refer to more details in: https://github.com/Res2Net/Res2Net-PretrainedModels/blob/master/res2net_v1b.py
Thank you so much Mr. Gao. I apologize for my silly question, as I did not realise that there was a padding needed in the convolutional operations. Take care! Appreciate the prompt reply!
请问这些模型的FLOPs是可以精准的计算出来吗?有什么方便的工具吗?
建议去github上搜 pytorch flops 就可以搜出很多计算工具。
为什么对特征分组,有什么依据吗?
group conv甚至depthwise conv 都已经证明特征分组有助于性能提升。本文在特征分组的基础上,添加了层内层级连接,进一步增强了网络的多尺度表达能力。
谢谢作者的及时解答,层内层级连接增强网络多尺度表达能力,仅从增加特征数量来说似乎只是在原特征的基础上做了增强,是否有更详细理论的解释?我觉得从卷积核的尺度上来说并没有加入新的特征。
我们的目的在于增加网络的多尺度表达能力,也就是感受野的丰富程度。也欢迎您从增加新的特征的角度改进此工作。
感谢您的解惑,请问增加感受野的丰富程度,能解释得具体一点吗?
feature经过bottleneck block,会有一个3×3和1×1两种感受野可能性;feature经过res2net module, feature每经过下一个3×3卷积或者直接与其他特征concat就用两种感受野可能性,由于层级结构,feature可能的感受野呈指数增长,所以是增加了感受野的丰富程度。
请问,效率方面,这个网络架构,显然是增加了模型的宽度,那么在时间上是会更长了,相比与resnet50
卷积的计算量与 输入通道x输出通道 成正比,res2net结构相比bottleneck结构,将一个具有较大输入输出通道的3×3卷积分解成多个输入输出通道数较小的卷积,所以并不是增加了模型的宽度。相反由于这种高效的连接方式,可以在相同计算量下大幅提升网络性能;也可以在相比resnet有相同性能时大幅降低计算量。已有多篇后续工作将res2net的结构应用到轻量级的分割、检测等任务上,都能实现很高效率。