Research

Revisiting Video Saliency Prediction in the Deep Learning Era

Wenguan WangJianbing Shen1 Jianwen Xie2  Ming-Ming Cheng3  Haibin Ling4 Ali Borji5  

1BLIIT, Beijing Institute of Technology    2Hikvision Research     3CCCE, Nankai University     4Temple University     5Markable AI

Abstract

Visual attention in static images has recently attracted a lot of research interests. However predicting visual attention in general, dynamic scenes has been very few touched. In this work, we contribute to video saliency research in two ways. First, we introduce a new benchmark for predicting human eye movements during dynamic scene free-viewing, which is long-time urged in this field. Our dataset, named DHF1K (Dynamic Human Fixation), consists of 1K high-quality, elaborately selected video sequences annotated by 17 observers with eye tracker equipment. Those videos are captured from spanning a large range of scenes, motions, object types, and background complexity. Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments. In contrast, DHF1K makes a significant leap in terms of scalability, diversity, and difficulty, and is expected to boost video saliency modeling. Second, we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. We thoroughly examine the performance of our model (ACLNet, Attentive CNN-LSTM Network), with respect to state-of-the-art saliency models, on three large-scale datasets (i.e., DHF1K, Hollywood2, UCF sports). Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that our model outperforms other competitors and has a fast processing speed (10fps; including all steps on one GPU).

Paper

  1. Revisiting Video Saliency Prediction in the Deep Learning Era, Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, Ali Borji, IEEE TPAMI, 2021. [pdf] [bib] [project page] [official version] [source code]
  2. Revisiting Video Saliency: A Large-scale Benchmark and a New Model, Wenguan  Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, Ali Borji, IEEE CVPR, 2018. [pdf] [source code] [bib] [project page]

Related Project

  • Shifting More Attention to Video Salient Object Detection, Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, Jianbing Shen. IEEE CVPR, 2019, Oral presentation, Best Paper Finalist, Accept rate: 0.87% [45/5160][project page | bib | official version中文版pdf ][poster | oral ppt | oral video | Code | Results | DAVSOD Dataset  (Baidu  [fetch code: ivzo]| Google)]

DHF1K Dataset

Our dataset contains annotated 1000 videos, splitting to 600 training (001.AVI-600.AVI), 100 validation (601.AVI-700.AVI) and 300 testing (701.AVI-1000.AVI). The annotations for the training and validation sets are released, but the annotations of the testing set are held-out for benchmarking. Detailed instructions for results submitting can be found here.

@ARTICLE{wang2019revisiting, 
    author={Wenguan Wang and Jianbing Shen and Jianwen Xie and Ming-Ming Cheng and Haibing Ling and Ali Borji}, 
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
    title={Revisiting Video Saliency Prediction in the Deep Learning Era}, 
    year={2019},  
}
@inproceedings{wang2018revisiting,
    title={Revisiting Video Saliency: A Large-scale Benchmark and a New Model},
    author={Wang, Wenguan and Shen, Jianbing and Guo, Fang and Cheng, Ming-Ming and Borji, Ali},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    year={2018},
}

Contact

If you have any question, drop us an e-mail at <wenguanwang.ai@gmail.com>.

Evaluation Code

https://github.com/wenguanwang/DHF1K/blob/master/ACL-evaluation.rar

DHF1K video saliency leaderboard

Notes: DLM: Deep Learning Models. D/S: Dynamic (D) or Static (S) Models. Python is the default programming language when standard deep learning kits (e.g. TensorFlow, Caffe, or Theano) are used. Our default testing environment uses 1 Titian X GPU and 4.0GHz Intel CPU.

MethodAUC-JSIMs-AUCCCNSSImplement.Size (MB)Time (s)DLMD/S
SalFoM 0.92220.42080.73520.56923.3536PyTorch15740.6D
TMFI0.91530.40680.73060.54613.1463PyTorch2340.033D
THTD-Net0.91520.40620.72960.54793.1385PyTorch2200.08D
STSANet0.91250.38290.72270.52883.0103PyTorch6430.035 (one Titan Xp GPU and 3.2GHz Intel CPU)D
TSFP-Net0.91160.39210.72300.51682.9665PyTorch58.40.011D
VSFT0.91090.41090.72000.51852.9773PyTorch71.40.04D
HD2S0.9080.4060.7000.5032.812PyTorch1160.03D
ViNet0.9080.3810.7290.511 2.872PyTorch1240.016D
UNISAL0.9010.3900.6910.4902.776PyTorch15.50.009D&S
SalSAC0.8960.3570.6970.4792.673PyTorch93.50.02D
TASED-Net0.8950.3610.7120.4702.667PyTorch820.06D
STRA-Net0.8950.355 0.663 0.458 2.558Tensorflow6410.02D
SalEMA0.8900.4660.6670.4492.574PyTorch3640.01D
ACLNet0.8900.315 0.601 0.4342.354Tensorflow2500.02D
SalGAN0.8660.2620.7090.3702.043Theano1300.02S
DVA0.8600.2620.5950.3582.013Caffe960.1S
SALICON0.8570.2320.5900.3271.901Caffe1170.5S
DeepVS0.8560.2560.5830.3441.911Tensorflow3440.05D
Deep-Net0.8550.2010.5920.3311.775Caffe1030.08S
Two-stream0.8340.1970.5810.3251.632Caffe31520D
UVA-Net0.833 0.241 0.5820.3071.536 --1/2588
Shallow-Net0.8330.1820.5290.2951.509Theano25000.1S
GBVS0.8280.1860.5540.2831.474C2.7S
Fang et al.0.8190.1980.5370.2731.539Matlab147D
ITTI0.7740.1620.5530.2331.207Matlab0.9S
Rudoy et al.0.7690.2140.5010.2851.498Matlab180D
Hou et al.0.7260.1670.5450.1500.847Matlab0.7D
AWS-D0.7030.1570.5130.1740.940Matlab9D
PQFT0.6990.1390.5620.1370.749Matlab1.2D
OBDL0.6380.1710.5000.1170.495Matlab0.8D
Seo et al.0.6350.1420.4990.0700.334Matlab2.3D
MCSDM0.5910.1100.5000.0470.247Matlab15D
MSM-SM0.5820.1430.5000.0580.245Matlab8D
PIM-ZEN0.5520.0950.4980.0620.280Matlab43D
PIM-MCS0.5510.0940.4990.0530.242Matlab10D
MAM 0.5510.1080.5000.0410.214Matlab778D
PMES0.5450.0930.5020.0550.237Matlab579D

Hollywood-2 video saliency leaderboard

MethodAUC-JSIMs-AUCCCNSS
TMFI0.9400.607-0.7394.095
STSANet0.9380.579-0.7213.927
VSFT0.9360.5770.8110.7033.916
TSFP-Net0.9360.571-0.7113.910
HD2S0.9360.5510.8070.6703.352
UNISAL0.9340.5420.7950.673 3.901
ViNet0.9300.5500.8130.6933.73
SalSAC0.9310.5290.7120.6703.356
STRA-Net0.923 0.536 0.7740.6623.478
SalEMA0.9190.4870.7080.6133.186
TASED-Net0.9180.5070.768 0.6463.302
ACLNet0.9130.5420.7570.6233.086
DeepVS0.8870.3560.6930.4462.313
DVA0.8860.3720.7270.4822.459
Deep-Net0.8840.3000.7360.4512.066
Two-stream0.8630.2760.710 0.3821.748
Fang et al.0.8590.2720.6590.3581.667
SALICON0.8560.3210.7110.4252.013
Shallow-Net0.8510.2760.6940.4231.680
GBVS0.8370.2570.6330.3081.336
ITTI0.7880.2210.6070.2571.076
Rudoy et al.0.7830.3150.5360.3021.570
Hou et al.0.7310.2020.5800.1460.684
PQFT0.7230.2010.6210.1530.755
PMES0.6960.1800.6200.1770.867
AWS-D0.6940.1750.6370.1460.742
MSM-SM0.6830.1800.5610.1320.682
PIM-ZEN0.6700.1670.5980.1340.667
PIM-MCS0.6630.1630.5700.1180.584
Seo et al.0.6520.1550.5300.0760.346
PNSP-CS0.6470.1460.5480.0770.370
OBDL0.6400.1700.5410.1060.462
MAM0.6300.1530.5620.0990.494
MCSDM0.6180.1470.5240.0670.288

UCF Sports video saliency leaderboard

MethodAUC-JSIMs-AUCCCNSS
TMFI0.9360.565-0.7073.863
STSANet 0.9360.560-0.7053.908
SalSAC0.9260.5340.8060.6713.523
ViNet0.924 0.5220.8100.6733.62
TSFP-Net0.9230.561-0.6853.698
UNISAL0.9180.523 0.775 0.644 3.381
STRA-Net0.910 0.479 0.751 0.593 3.018
SalEMA0.9060.4310.7400.5442.638
HD2S0.9040.5070.7680.6043.114
TASED-Net0.8990.4690.7520.5822.920
ACLNet0.8970.4060.7440.5102.567
DVA0.8720.3390.7250.4392.311
DeepVS0.8700.3210.6910.4052.089
Deep-Net0.8610.2820.7190.4141.903
GBVS0.8590.2740.6970.3961.818
SALICON0.8480.3040.7380.3751.838
ITTI0.8470.2510.7250.3561.640
Shallow-Net0.8460.2760.6910.3821.789
Fang et al.0.8450.3070.6740.3951.787
Two-stream0.8320.2640.6850.3431.753
Seo et al.0.8310.3080.6660.3361.690
PQFT0.8250.2500.7220.3381.780
AWS-D0.8230.2280.7500.3061.631
Hou et al.0.8190.2760.6740.2921.399
PIM-MCS0.7770.2380.6950.3031.596
Rudoy et al.0.7630.2710.6370.3441.619
PIM-ZEN0.7600.2340.7020.3061.657
OBDL0.7590.1930.6340.2341.382
PMES0.7560.2630.7140.3491.788
MCSDM0.7560.2280.6260.2301.091
PNSP-CS0.7550.2100.6280.2181.091
MSM-SM0.7520.2620.6340.2801.584
MAM0.6690.2130.6240.2181.130

DIEM video saliency leaderboard

Notes: Test set i: use the first 300 frames of 20 test videos for testing; Test set ii: use 17 of 20 test videos for testing; Test set iii: use all frames of all 20 test videos for testing

Test setMethod AUC-JSIMs-AUCCCNSS
iiiTMFI0.9200.604-0.7403.031
iiiSTSANet0.9050.548 -0.6902.787
iiTMFI0.9210.598-0.7262.956
iiSTSANet0.9060.541-0.6772.721
iiTSFP-Net (Test set ii)0.906 0.527 -0.651 2.62
iiTSFP-Net0.905 0.529 -0.649 2.63
iiViNet (with audio)0.899 0.498 0.7190.6322.53
iiViNet0.8980.4830.7230.626
iTMFI0.9160.565-0.6922.955
iSTSANet0.9010.505 -0.625 2.618
iACLNet0.8810.2770.693 0.396 2.368
iSTRA-Net0.870 0.306 0.678 0.408 2.452
iDVA0.8680.2370.7210.3862.347
iTwo-stream0.8590.2560.6820.366 2.171
iDeepVS0.8570.238 0.6930.371 2.235
iDeep-Net0.8490.1640.6970.2911.650
iShallow-Net0.8380.1880.6200.2971.646
i Fang et al.0.823 0.167 0.6360.251 1.423
iGBVS
0.813 0.156 0.633 0.214 1.198
iSALICON 0.793 0.171 0.674 0.270 1.650
i ITTI
0.791 0.132 0.653 0.196 1.103
iRudoy et al. 0.775 0.150 0.618 0.260 1.390
i AWS-D0.7740.1500.695 0.2161.252
i OBDL
0.7620.165 0.694 0.221 1.289
iHou et al. 0.735 0.142 0.589 0.128 0.735
iPQFT
0.724 0.1260.649 0.144 0.856
iSeo et al.0.723 0.130 0.568 0.116 0.665
iMCSDM
0.6630.105 0.5580.084 0.466
iPIM-MCS0.662 0.110 0.6070.124 0.709
iPIM-ZEN
0.660 0.114 0.615 0.1320.757
iPMES
0.657 0.1220.607 0.142 0.817
iPNSP-CS
0.637 0.091 0.559 0.074 0.417
iMSM-SM
0.619 0.092 0.571 0.107 0.624
iMAM0.579 0.089 0.5520.0720.408

LEDOV video saliency leaderboard

Note: the listed results are provided by the author of LEDOV (Lai Jiang: jianglai.china@gmail.com), and the used evaluation codes (https://github.com/remega/LEDOV-eye-tracking-database/tree/master/metrics) are different from ours (https://github.com/wenguanwang/DHF1K). 

MethodAUC-JNSSCCKL
DeepVS0.902
2.9990.5861.222
ACLNet0.897
2.872
0.5701.445
SalGAN
0.8682.1930.4281.680
DVA
0.8852.8400.5571.323
SAlLICON0.8512.332
0.4371.635
Sal-DCNN
0.892
2.838
0.5731.304
GBVS0.8391.5410.3221.824
Rudoy et al.0.7991.4540.3202.421
AWS-D0.7951.3650.2942.023
PQFT0.6990.6900.1402.461
OBDL0.8011.5450.3152.053
Xu et al.0.8271.4750.3821.652
BMS0.7570.9790.2142.225

Mthods

  1. VSFT: Video Saliency Forecasting Transformer, C. Ma, H. Sun, Y. Rao, J. Zhou, J. Lu, TCSVT, 2022.
  2. STSANet: Spatio-Temporal Self-Attention Network for Video Saliency Prediction, Z. Wang, Z. Liu, G. Li, T. Zhang, L. Xu and J. Wang, TMM, 2021.
  3. TSFP-Net:  Temporal-Spatial Feature Pyramid for Video Saliency Detection, Q. Chang, S. Zhu, and L. Zhu, arXiv 2105.04213, 2021.
  4. ViNet: Diving Deep into Audio-Visual Saliency Prediction, S. Jain, P. Yarlagadda, R. Subramanian, and V. Gandhi, arxiv 2012.06170, 2020.
  5. HD2S: Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction,  G. Bellitto, F. Proietto Salanitri, S. Palazzo, F. Rundo, D. Giordano, and C. Spampinato, IJCV, 2021.
  6. UNISAL: Unified Image and Video Saliency Modeling, R. Droste, J. Jiao, and J. A. Noble, ECCV, 2020.
  7. SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-based ConvLSTM, X. Wu, Z. Wu, J. Zhang, L. Ju, and S. Wang, AAAI, 2020
  8. UVA-Net: Ultrafast Video Attention Prediction with Coupled Knowledge Distillation, K. Fu, P. Shi, Y. Song, S. Ge, X. Lu, and J. Li, AAAI, 2020
  9. TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection, K. Min and J. J. Corso, ICCV, 2019
  10. STRA-Net: Video Saliency Prediction using Spatiotemporal Residual Attentive Networks, Q. Lai, W. Wang, H. Sun, and J. Shen, IEEE Transactions on Image Processing, 2019
  11. SalEMA: Simple vs complex temporal recurrences for video saliency prediction, P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-i-Nieto, and K. McGuinness, BMVC, 2019
  12. ACLNet: Revisiting Video Saliency: A Large-scale Benchmark and a New Model, W.  Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji, IEEE CVPR, 2018.
  13. ITTI: A model of saliency-based visual attention for rapid scene analysis,  L. Itti, C. Koch, and E. Niebur, IEEE TPAMI, 1998.
  14. GBVS: Graph-based visual saliency, J. Harel, C. Koch, and P. Perona, NIPS, 2007.
  15. SALICON: SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks, X. Huang, C. Shen, X. Boix, and Q. Zhao, IEEE ICCV, 2015.
  16. Shallow-Net:  Shallow and deep convolutional networks for saliency prediction, J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, IEEE CVPR, 2016.
  17. Deep-Net:  Shallow and deep convolutional networks for saliency prediction, J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, IEEE CVPR, 2016.
  18. DVA:  Deep visual attention prediction, IEEE Transactions on Image Processing, W. Wang and J. Shen, 2018.
  19. PQFT: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, C. Guo and L. Zhang, IEEE TIP, 2010.
  20. Seo et al.: Static and space-time visual saliency detection by self-resemblance, H. J. Seo and P. Milanfar, Journal of Vision, 2009.
  21. Rudoy et al.:  Learning video saliency from human gaze using candidate selection, D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik-Manor,IEEE CVPR, 2013.
  22. Hou et al.: Dynamic visual attention: Searching for coding length increments, X. Hou and L. Zhang, NIPS, 2008.
  23. Fang et al.: Video saliency incorporating spatiotemporal cues and uncertainty weighting, Y. Fang, Z. Wang, W. Lin, and Z. Fang, IEEE TIP, 2014.
  24. OBDL: How many bits does it take for a stimulus to be salient? S. Hossein Khatoonabadi, N. Vasconcelos, I. V. Bajic, and Y. Shan, IEEE CVPR, 2015.
  25. AWS-D:  Dynamic whitening saliency, V. Leboran, A. Garcia-Diaz, X. R. Fdez-Vidal, and X. M. Pardo, IEEE TPAMI, 2017.
  26. PMES: A new perceived motion based shot content representation, Y.-F. Ma and H.-J. Zhang, ICIP, 2001.
  27. MAM:  A fast algorithm to find the region-of-interest in the compressed mpeg domain, G. Agarwal, A. Anbu, and A. Sinha, IEEE ICME, 2003.
  28. PIM-ZEN: A model of motion attention for video skimming, Y.-F. Ma and H.-J. Zhang, ICIP, 2002.
  29. PIM-MCS: Region-of-interest based compressed domain video transcoding scheme, A. Sinha, G. Agarwal, and A. Anbu, IEEE ICASSP, 2004.
  30. MCSDM: A motion attention model based rate control algorithm for h.264/avc, Z. Liu, H. Yan, L. Shen, Y. Wang, and Z. Zhang, ICCIS, 2009.
  31. MSM-SM: Salient motion detection in compressed domain, K. Muthuswamy and D. Rajan, IEEE SPL, 2013.
  32. PNSP-CS: A video saliency detection model in compressed domain, Y. Fang, W. Lin, Z. Chen, C. M. Tsai, and C. W. Lin, IEEE CSVT, 2014
  33. DeepVSDeepVS: A Deep Learning Based Video Saliency Prediction Approach, L. Jiang, M. Xu, and Z. Wang, ECCV, 2018.
  34. Two-stream:  Spatio-temporal saliency networks for dynamic saliency prediction, C. Bak, A. Kocak, E. Erdem, and A. Erdem, IEEE TMM, 2017.
  35. SalGAN:  SalGAN: Visual Saliency Prediction with Generative Adversarial Networks, J. Pan, E. Sayrol,  E. Giro-i-Nieto, C. C. Ferrer, J. Torres, K. McGuiness, N. E. O’Connor, IEEE CVPR-workshop, 2017.
  36. Sal-DCNN: Image Saliency Prediction in Transformed Domain: A Deep Complex Neural Network Method, L. Jiang, Z. Wang, and M. Xu, AAAI, 2018
  37. Xu et al.: Learning to detect video saliency with HEVC features, M. Xu, L. Jiang, X. Sun, IEEE TIP, 2017
  38. BMS: Exploiting surroundedness for saliency detection: a Boolean map approach, J. Zhang and S. Sclaroff, IEEE PAMI, 2016

Eye-tracking datasets

dynamic scenes:

  1. Holly-wood2
  2. UCF-sports
  3. DIEM
  4. LEDOV
  5. CRCNS
  6. SFU

static scenes:

  1. MIT300
  2. MIT1003
  3. TORNTO
  4. PASCAL-S
  5. DUT-OMRON
  6. SALICON
(Visited 23,456 times, 11 visits today)
Subscribe
Notify of
guest

12 Comments
Inline Feedbacks
View all comments

[…] 地址:Revisiting Video Saliency Prediction in the Deep Learning Era […]

Chenming Li

老师以及测试人员,您好!我将自己的模型方法在DHF1K上的测试集按照Github上的要求模型名称/视频名称/显著图 .png的格式发送至邮箱,一直没有回复。请问目前还可以回复吗?

晨鸣Li

老师以及测试人员,您好!我将自己的模型方法在DHF1K上的测试集按照Github上的要求模型名称/视频名称/显著图 .png的格式发送至邮箱,一直没有回复。请问目前还可以回复吗?

Tom

Hello, teacher. As mentioned in your github of DHF1K: “Note that, for Holly-wood2 dataset, we used the split videos (each video only contains one shot), instead of the full videos.”. Can you tell us which shot segmentation algorithm is used? Thanks.

Wenguan Wang

The videos of Holly-wood2 already come with the shot split.

HuHao

老师,您好。我将自己的在DHF1K上的按照 模型名称/视频名称/显著图 .png的格式发送至邮箱,一直没有回复。请问这个是随机进行测试的吗?

Wenguan Wang

check一下邮箱地址:dhf1kdataset@gmail.com

我们一直在回复。

Tom

请问老师,目前还在回复吗?我最近连续两周发送的邮件都没有收到答复,麻烦催促一下。谢谢。

KrisCao

Excellent job.
I am wondering what does the time under DHF1K video saliency leaderboard means? Is it reference time per frame? Since certain models downsample the video in temporal domain, do you take this under consideration?