DemoResearch

Res2Net: A New Multi-scale Backbone Architecture

Online Demo

Shanghua Gao 1, Ming-Ming Cheng 1Kai Zhao 1Xin-Yu Zhang 1, Ming-Hsuan Yang 2, Philip Torr3

1
TKLNDST, CS, Nankai University      2UC, Merced     3University of Oxford

Figure 1. We propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g., ResNet, ResNeXt, BigLittleNet, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models.

1. Abstract

Representing features at multiple scales is of great importance for numerous vision tasks. Recent advances in backbone convolutional neural networks (CNNs) continually demonstrate stronger multi-scale representation ability, leading to consistent performance gains on a wide range of applications. However, most existing methods represent the multi-scale features in a layerwise manner. In this paper, we propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. The proposed Res2Net block can be plugged into the state-of-the-art backbone CNN models, e.g. ResNet, ResNeXt, and DLA. We evaluate the Res2Net block on all these models and demonstrate consistent performance gains over baseline models on widely-used datasets, e.g. CIFAR-100 and ImageNet. Further ablation studies and experimental results on representative computer vision tasks, i.e. object detection, class activation mapping, and salient object detection, further verify the superiority of the Res2Net over the state-of-the-art baseline methods.

Source Code and pre-trained model: https://github.com/Res2Net

2. Paper

  1. Res2Net: A New Multi-scale Backbone Architecture, Shang-Hua Gao#, Ming-Ming Cheng#*, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, Philip Torr, IEEE TPAMI, 43(2):652-662, 2021. [pdf | code | project |PPT | bib | 中译版|LaTeX ]

3. Applications

Res2Net is found to be useful in almost all computer vision applications we have tried so far. If you found it useful in your applications and want to share with others, please contact us to add a link in this project page.

News

  • 2020.10.20 PaddlePaddle version Res2Net achieves 85.13% top-1 acc. on ImageNet: PaddlePaddle Res2Net.
  • 2020.8.21 Online demo for detection and segmentation using Res2Net is released http://mc.nankai.edu.cn/res2net-det
  • 2020.7.29 The training code of Res2Net on ImageNet is released https://github.com/Res2Net/Res2Net-ImageNet-Training (non-commercial use only)
  • 2020.6.1 Res2Net is now in the official model zoo of the new deep learning framework Jittor.
  • 2020.5.21 Res2Net is now one of the basic bonebones in MMDetection v2 framework https://github.com/open-mmlab/mmdetection. Using MMDetection v2 with Res2Net achieves better performance with less computational cost.
  • 2020.5.11 Res2Net achieves about 2% performance gain on Panoptic Segmentation based on detectron2 with no trick.
  • 2020.3.14 Res2Net backbone allows latest interactive segmentation method to significantly reduce the number of required user interactions compared with best reported results.
  • 2020.2.24 Our Res2Net_v1b achieves better detection performance on popular mmdetection platform, outperforming previous best results achieved by HRNet backbone, while consuming only about 50% of parameters and computings!
  • 2020.2.21: Pretrained models of Res2Net_v1b with more than 2% improvement on ImageNet top1 acc. compared with PAMI version of Res2Net!

3.1 Classification

Res2Net module can replace the bottleneck block with no other modification.

We have implemented the Res2Net module into many state-of-the-art backbone networks: ResNet, ResNeXt, DLA, SE-NET, DLA, Big-Little Net. Source codes of those backbone models are available at https://github.com/gasvn/Res2Net .

model#ParamsGFLOPstop-1 errortop-5 errorLink
Res2Net-50-48w-2s25.29M4.222.686.47OneDrive
Res2Net-50-26w-4s25.70M4.222.016.15OneDrive
Res2Net-50-14w-8s25.06M4.221.866.14OneDrive
Res2Net-50-26w-6s37.05M6.321.425.87OneDrive
Res2Net-50-26w-8s48.40M8.320.805.63OneDrive
Res2Net-101-26w-4s45.21M8.120.815.57OneDrive
Res2NeXt-5024.67M4.221.766.09OneDrive
Res2Net-DLA-6021.15M4.221.535.80OneDrive
Res2NeXt-DLA-6017.33M3.621.555.86OneDrive
Res2Net-v1b-5025.72M4.519.734.96Link
Res2Net-v1b-10145.23M8.318.774.64Link

The download link from Baidu Disk is now available. (Baidu Disk password: vbix)

3.2 Pose estimation

Pose Estimation Task requires localization of person keypoints in challenging, uncontrolled conditions. This task involves simultaneously detecting people and localizing their keypoints.

We use Simple Baselines as the baseline method for Pose Estimation. Source code is available at https://github.com/gasvn/Res2Net-Pose-Estimation .

Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset

ArchInput sizeAPAp .5AP .75AP (M)AP (L)
pose_resnet_50256×1920.7040.8860.7830.6710.772
pose_res2net_50256×1920.7370.9250.8140.7080.782
pose_resnet_101256×1920.7140.8930.7930.6810.781
pose_res2net_101256×1920.7440.9260.8260.7200.785

3.3 Instance segmentation

We use MaskRCNN as the baseline method for Instance segmentation and Object detection. We use the maskrcnn-benchmark as the baseline. Source code is available at  https://github.com/gasvn/Res2Net-maskrcnn .

Instance segmentation is the combination of object detection and semantic segmentation. It requires not only the correct detection of objects with various sizes in an image but also the precise segmentation of each object.

Performance on Instance segmentation:

BackboneSettingAPAP50AP75APsAPmAPl
ResNet-5064w33.955.236.014.836.050.9
ResNet-5048w×2s34.255.636.314.936.850.9
Res2Net-5026w×4s35.657.637.615.737.953.7
Res2Net-5018w×6s35.757.538.115.438.153.7
Res2Net-5014w×8s35.357.037.515.637.553.4
ResNet-10164w35.557.037.916.038.252.9
Res2Net-10126w×4s37.159.439.416.640.055.6

3.4 Object detection

We use MaskRCNN as the baseline method for Instance segmentation and Object detection. We use the maskrcnn-benchmark as the baseline. Source code is available at  https://github.com/gasvn/Res2Net-maskrcnn .

Performance on Object detection:

BackboneSettingAPAP50AP75APsAPmAPl
ResNet-5064w37.558.440.320.640.149.7
ResNet-5048w×2s38.058.941.320.541.049.9
Res2Net-5026w×4s39.660.943.122.042.352.8
Res2Net-5018w×6s39.960.943.321.842.853.7
Res2Net-5014w×8s39.160.242.121.741.752.8
ResNet-10164w39.660.643.222.043.252.4
Res2Net-10126w×4s41.862.645.623.445.555.6

3.5 Salient object detection

Precisely locating the salient object regions in an image requires an understanding of both large-scale context information for the determination of object saliency, as well as small-scale features to localize object boundaries accurately.

We use  PoolNet (cvpr19) as the baseline method for Salient Object Detection . Source code is available at https://github.com/gasvn/Res2Net-PoolNet .

Results on salient object detection datasets without joint training with edge. Models are trained using DUTS-TR.

BackboneECSSDPASCAL-SDUT-OHKU-ISSODDUTS-TE
MaxF & MAEMaxF & MAEMaxF & MAEMaxF & MAEMaxF & MAEMaxF & MAE
vgg0.936 & 0.0470.857 & 0.0780.817 & 0.0580.928 & 0.0350.859 & 0.1150.876 & 0.043
resnet500.940 & 0.0420.863 & 0.0750.830 & 0.0550.934 & 0.0320.867 & 0.1000.886 & 0.040
res2net500.947 & 0.0360.871 & 0.0700.837 & 0.0520.936 & 0.0310.885 & 0.0960.892 & 0.037

3.6 Segmantic segmentation

Segmantic segmentation results of Deeplab v3+ using ResNet/Res2Net as backbone model.

3.7 Detection benchmark (mmdetection) tasks

BackboneParams.GFLOPsbox AP
R-101-FPN60.52M283.1439.4
X-101-64x4d-FPN99.25M440.3641.3
HRNetV2p-W4883.36M459.6641.5
Res2Net-10161.18M293.6842.3
Comparison of Faster R-CNN based detection. The Res2Net based method achieves better results and significantly less computation and memory footprint. See more results for Mask-R-CNN, Cascade R-CNN, Cascade Mask R-CNN, and Hybrid Task Cascade in mmdetection benchmark.

3.8. Vectorized road extraction

Vectorized road extraction from Tan et. al. in CVPR 2020.

3.9 Interactive image segmentation

Interactive image segmentation from Lin et. al. in CVPR 2020, which
To achieve certain accuracy, the number of user interactions required by the new method is nearly half of the previous most powerful method!

3.10 Tumor segmentation on CT scans (from Sun et al. 2019)

Tumor segmentaton on CT scans. From: Sun et al. 2019.

3.11 Person Re-ID (from Cao et al )

Cao et al. use Res2Net to significantly boost performance of ReID applications.

3.12 Single-stage object detection (from Chen et al)

Chen et al. use Res2Net for one-stage object detection for CPU-only devices.

3.13 Depth prediction (from Weida Yang)

Weida Yang uses Res2Net for getting impressive depth detection results.

3.14 Semantic image to photo-realistic image translation

SemanticGAN from Liu et al. 2020.

3.15 Res2NetPlus for solar panel detector

A solar panel detector from satellite imagery, developed by Less Wright, who found the Res2Net50 to have both greater accuracy (+5%), and steadier training. See also his blog or a Chinese translated version of Wright’s blog.

3.16 Speaker Verification

Zhou et. al. (IEEE SLT 2021) found that ResNeXt and Res2Net can significantly outperform the conventional ResNet model. The Res2Net model achieved superior performance by reducing the EER by 18.5% relative. Experiments on the other two internal test sets of mismatched conditions further confirmed the generalization of the ResNeXt and Res2Net architectures against noisy environment and segment length variations.

Zhou et. al. (IEEE SLT 2021) found that Res2Net achieves superior performance for processing speech data.

3.17 Protein Structure Prediction

Su et. al (Advanced Science 2021) found that Res2Net achieves super-performance for Protein structure prediction.

(Visited 33,810 times, 11 visits today)
Subscribe
Notify of
guest

29 Comments
Inline Feedbacks
View all comments
freshkfm

作者你好,我最近看到这篇文章,仔细读过后很少倾佩,想问下这种新颖的想法是如何萌发的?此外,我之前也看到过一篇FractalNet的论文,大概和densenet同时期,它有点类似这个scale思想,不同的是各个通道间没有交互。此外resnext中的cardinality可以借助group_convolution实现,但是训练速度会大幅降低,该怎么优化?谢谢

Gao Shanghua

你好,res2net的最初思路源于如何高效低成本的实现多尺度特征的提取,我们通过类似resnext的分group的机制,在细粒度层面对同一层级的不同group的特征进行层次递进的处理,实现低计算开销的感受野组合爆炸效用。 fractalnet和densenet在跨block的粗粒度上进行特征融合,且每个后续节点大量使用前面层的特征,这样会导致特征的冗余和重复计算,从而计算复杂度相对较高,参数利用率有限。直观理解,你可以观察到在fractalnet和densenet里面的dense连接下,每一个中间节点都会接收前面的所有特征;而res2net是分组后层次性融合特征,并不是每个中间节点都会接收前面所有的特征信息。

group conv的速度变慢是因为在分group后,我们可以在同样参数量计算量条件下使用更多的channel,但是由于现有硬件的优化问题,io替代计算成为新的瓶颈,所以显得group conv会慢。在定制化设备上和更新的硬件上,这一问题会得到解决。

czq

我是一个初入深度学习领域的小白,想问一下关于res2net的deeplab实现。最后两个层的步长也是变为1吗,关于预训练权重的使用,请问进行deeplab实现时,改变了最后两层的步长,还是使用原有打预训练模型吗

czq

非常感谢您的解答

afei
尊敬的作者,您好。我是刚入门图像识别的小白  看了认识net深受启发  但是有个问题不太明白  想咨询下您    D = int(math.floor(planes * (baseWidth / 64.0)))中的D指的啥    
freshkfm

我最近也在看这篇论文,这里你要参考resnext论文,d应该是指每组卷积核数量,下文的C是cardinality对应为组数

Last edited 2 years ago by freshkfm
solauky

请问现在1vb版还没有集成到mmdetection当中吗?谢谢~

solauky

mmdetection目前只看到v1d版本啊?然后对比试跑了一下,跑不过HRNet40,我还以为列出的对比数据是用v1b来测试的。

solauky

可能和样本尺度分布有点关系吧。我用的建筑物数据集测试,里面小样本占比高。2x跑下来就大概res2net-101比HRNnet-W40的AP低1~2个点左右,但GPU占用和训练时间比HRnet少。后来又试了一下152+dnv,还是要低一些。而且网络加深以后,反而GPU占用和训练时间比HRnet多了。

222.jpg
Wang Shiqin

请问一下,对于显著性目标检测任务,Res2Net解决的是多尺度的问题,解决的尺度范围有多大?Res2Net是否能解决small object的问题?

Last edited 4 years ago by Wang Shiqin
kj172

请问您的这个res2net和resnet34在分割的性能上比较了吗?哪个更好一点

MM Cheng

语义分割的性能在我们代码主页上有更新的结果,在mmdetection和detectron2上都取得了明显优于其他方法的性能。

Kelvin

Hi there, may I know how do you do the addition of the separated channels before K3, K4… etc etc?

Thank you.

Kelvin

Thank you so much Mr. Gao. I apologize for my silly question, as I did not realise that there was a padding needed in the convolutional operations. Take care! Appreciate the prompt reply!

ezra

请问这些模型的FLOPs是可以精准的计算出来吗?有什么方便的工具吗?

kururu

为什么对特征分组,有什么依据吗?

kururu

谢谢作者的及时解答,层内层级连接增强网络多尺度表达能力,仅从增加特征数量来说似乎只是在原特征的基础上做了增强,是否有更详细理论的解释?我觉得从卷积核的尺度上来说并没有加入新的特征。

kururu

感谢您的解惑,请问增加感受野的丰富程度,能解释得具体一点吗?

TJM

请问,效率方面,这个网络架构,显然是增加了模型的宽度,那么在时间上是会更长了,相比与resnet50