Revisiting Video Saliency Prediction in the Deep Learning Era

16/05/2018 MM Cheng

Wenguan Wang¹Jianbing Shen¹ Jianwen Xie² Ming-Ming Cheng³ Haibin Ling⁴ Ali Borji⁵

¹BLIIT, Beijing Institute of Technology ²Hikvision Research ³CCCE, Nankai University ⁴Temple University ⁵Markable AI

Abstract

Visual attention in static images has recently attracted a lot of research interests. However predicting visual attention in general, dynamic scenes has been very few touched. In this work, we contribute to video saliency research in two ways. First, we introduce a new benchmark for predicting human eye movements during dynamic scene free-viewing, which is long-time urged in this field. Our dataset, named DHF1K (Dynamic Human Fixation), consists of 1K high-quality, elaborately selected video sequences annotated by 17 observers with eye tracker equipment. Those videos are captured from spanning a large range of scenes, motions, object types, and background complexity. Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments. In contrast, DHF1K makes a significant leap in terms of scalability, diversity, and difficulty, and is expected to boost video saliency modeling. Second, we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. We thoroughly examine the performance of our model (ACLNet, Attentive CNN-LSTM Network), with respect to state-of-the-art saliency models, on three large-scale datasets (i.e., DHF1K, Hollywood2, UCF sports). Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that our model outperforms other competitors and has a fast processing speed (10fps; including all steps on one GPU).

Paper

Revisiting Video Saliency Prediction in the Deep Learning Era, Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, Ali Borji, IEEE TPAMI, 2021. [pdf] [bib] [project page] [official version] [source code]
Revisiting Video Saliency: A Large-scale Benchmark and a New Model, Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, Ali Borji, IEEE CVPR, 2018. [pdf] [source code] [bib] [project page]

Related Project

Shifting More Attention to Video Salient Object Detection, Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, Jianbing Shen. IEEE CVPR, 2019, Oral presentation, Best Paper Finalist, Accept rate: 0.87% [45/5160][project page | bib | official version | 中文版pdf ][poster | oral ppt | oral video | Code | Results | DAVSOD Dataset (Baidu [fetch code: ivzo]| Google)]

DHF1K Dataset

Our dataset contains annotated 1000 videos, splitting to 600 training (001.AVI-600.AVI), 100 validation (601.AVI-700.AVI) and 300 testing (701.AVI-1000.AVI). The annotations for the training and validation sets are released, but the annotations of the testing set are held-out for benchmarking. Detailed instructions for results submitting can be found here.

@ARTICLE{wang2019revisiting, 
    author={Wenguan Wang and Jianbing Shen and Jianwen Xie and Ming-Ming Cheng and Haibing Ling and Ali Borji}, 
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
    title={Revisiting Video Saliency Prediction in the Deep Learning Era}, 
    year={2019},  
}
@inproceedings{wang2018revisiting,
    title={Revisiting Video Saliency: A Large-scale Benchmark and a New Model},
    author={Wang, Wenguan and Shen, Jianbing and Guo, Fang and Cheng, Ming-Ming and Borji, Ali},
    booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    year={2018},
}

Contact

If you have any question, drop us an e-mail at <wenguanwang.ai@gmail.com>.

Evaluation Code

https://github.com/wenguanwang/DHF1K/blob/master/ACL-evaluation.rar

DHF1K video saliency leaderboard

Notes: DLM: Deep Learning Models. D/S: Dynamic (D) or Static (S) Models. Python is the default programming language when standard deep learning kits (e.g. TensorFlow, Caffe, or Theano) are used. Our default testing environment uses 1 Titian X GPU and 4.0GHz Intel CPU.

Method	AUC-J	SIM	s-AUC	CC	NSS	Implement.	Size (MB)	Time (s)	DLM	D/S
SalFoM	0.9222	0.4208	0.7352	0.5692	3.3536	PyTorch	1574	0.6	√	D
TMFI	0.9153	0.4068	0.7306	0.5461	3.1463	PyTorch	234	0.033	√	D
THTD-Net	0.9152	0.4062	0.7296	0.5479	3.1385	PyTorch	220	0.08	√	D
STSANet	0.9125	0.3829	0.7227	0.5288	3.0103	PyTorch	643	0.035 (one Titan Xp GPU and 3.2GHz Intel CPU)	√	D
TSFP-Net	0.9116	0.3921	0.7230	0.5168	2.9665	PyTorch	58.4	0.011	√	D
VSFT	0.9109	0.4109	0.7200	0.5185	2.9773	PyTorch	71.4	0.04	√	D
HD2S	0.908	0.406	0.700	0.503	2.812	PyTorch	116	0.03	√	D
ViNet	0.908	0.381	0.729	0.511	2.872	PyTorch	124	0.016	√	D
UNISAL	0.901	0.390	0.691	0.490	2.776	PyTorch	15.5	0.009	√	D&S
SalSAC	0.896	0.357	0.697	0.479	2.673	PyTorch	93.5	0.02	√	D
TASED-Net	0.895	0.361	0.712	0.470	2.667	PyTorch	82	0.06	√	D
STRA-Net	0.895	0.355	0.663	0.458	2.558	Tensorflow	641	0.02	√	D
SalEMA	0.890	0.466	0.667	0.449	2.574	PyTorch	364	0.01	√	D
ACLNet	0.890	0.315	0.601	0.434	2.354	Tensorflow	250	0.02	√	D
SalGAN	0.866	0.262	0.709	0.370	2.043	Theano	130	0.02	√	S
DVA	0.860	0.262	0.595	0.358	2.013	Caffe	96	0.1	√	S
SALICON	0.857	0.232	0.590	0.327	1.901	Caffe	117	0.5	√	S
DeepVS	0.856	0.256	0.583	0.344	1.911	Tensorflow	344	0.05	√	D
Deep-Net	0.855	0.201	0.592	0.331	1.775	Caffe	103	0.08	√	S
Two-stream	0.834	0.197	0.581	0.325	1.632	Caffe	315	20	√	D
UVA-Net	0.833	0.241	0.582	0.307	1.536	-	-	1/2588
Shallow-Net	0.833	0.182	0.529	0.295	1.509	Theano	2500	0.1	√	S
GBVS	0.828	0.186	0.554	0.283	1.474	C		2.7		S
Fang et al.	0.819	0.198	0.537	0.273	1.539	Matlab		147		D
ITTI	0.774	0.162	0.553	0.233	1.207	Matlab		0.9		S
Rudoy et al.	0.769	0.214	0.501	0.285	1.498	Matlab		180		D
Hou et al.	0.726	0.167	0.545	0.150	0.847	Matlab		0.7		D
AWS-D	0.703	0.157	0.513	0.174	0.940	Matlab		9		D
PQFT	0.699	0.139	0.562	0.137	0.749	Matlab		1.2		D
OBDL	0.638	0.171	0.500	0.117	0.495	Matlab		0.8		D
Seo et al.	0.635	0.142	0.499	0.070	0.334	Matlab		2.3		D
MCSDM	0.591	0.110	0.500	0.047	0.247	Matlab		15		D
MSM-SM	0.582	0.143	0.500	0.058	0.245	Matlab		8		D
PIM-ZEN	0.552	0.095	0.498	0.062	0.280	Matlab		43		D
PIM-MCS	0.551	0.094	0.499	0.053	0.242	Matlab		10		D
MAM	0.551	0.108	0.500	0.041	0.214	Matlab		778		D
PMES	0.545	0.093	0.502	0.055	0.237	Matlab		579		D

Hollywood-2 video saliency leaderboard

Method	AUC-J	SIM	s-AUC	CC	NSS
TMFI	0.940	0.607	-	0.739	4.095
STSANet	0.938	0.579	-	0.721	3.927
VSFT	0.936	0.577	0.811	0.703	3.916
TSFP-Net	0.936	0.571	-	0.711	3.910
HD2S	0.936	0.551	0.807	0.670	3.352
UNISAL	0.934	0.542	0.795	0.673	3.901
ViNet	0.930	0.550	0.813	0.693	3.73
SalSAC	0.931	0.529	0.712	0.670	3.356
STRA-Net	0.923	0.536	0.774	0.662	3.478
SalEMA	0.919	0.487	0.708	0.613	3.186
TASED-Net	0.918	0.507	0.768	0.646	3.302
ACLNet	0.913	0.542	0.757	0.623	3.086
DeepVS	0.887	0.356	0.693	0.446	2.313
DVA	0.886	0.372	0.727	0.482	2.459
Deep-Net	0.884	0.300	0.736	0.451	2.066
Two-stream	0.863	0.276	0.710	0.382	1.748
Fang et al.	0.859	0.272	0.659	0.358	1.667
SALICON	0.856	0.321	0.711	0.425	2.013
Shallow-Net	0.851	0.276	0.694	0.423	1.680
GBVS	0.837	0.257	0.633	0.308	1.336
ITTI	0.788	0.221	0.607	0.257	1.076
Rudoy et al.	0.783	0.315	0.536	0.302	1.570
Hou et al.	0.731	0.202	0.580	0.146	0.684
PQFT	0.723	0.201	0.621	0.153	0.755
PMES	0.696	0.180	0.620	0.177	0.867
AWS-D	0.694	0.175	0.637	0.146	0.742
MSM-SM	0.683	0.180	0.561	0.132	0.682
PIM-ZEN	0.670	0.167	0.598	0.134	0.667
PIM-MCS	0.663	0.163	0.570	0.118	0.584
Seo et al.	0.652	0.155	0.530	0.076	0.346
PNSP-CS	0.647	0.146	0.548	0.077	0.370
OBDL	0.640	0.170	0.541	0.106	0.462
MAM	0.630	0.153	0.562	0.099	0.494
MCSDM	0.618	0.147	0.524	0.067	0.288

UCF Sports video saliency leaderboard

Method	AUC-J	SIM	s-AUC	CC	NSS
TMFI	0.936	0.565	-	0.707	3.863
STSANet	0.936	0.560	-	0.705	3.908
SalSAC	0.926	0.534	0.806	0.671	3.523
ViNet	0.924	0.522	0.810	0.673	3.62
TSFP-Net	0.923	0.561	-	0.685	3.698
UNISAL	0.918	0.523	0.775	0.644	3.381
STRA-Net	0.910	0.479	0.751	0.593	3.018
SalEMA	0.906	0.431	0.740	0.544	2.638
HD2S	0.904	0.507	0.768	0.604	3.114
TASED-Net	0.899	0.469	0.752	0.582	2.920
ACLNet	0.897	0.406	0.744	0.510	2.567
DVA	0.872	0.339	0.725	0.439	2.311
DeepVS	0.870	0.321	0.691	0.405	2.089
Deep-Net	0.861	0.282	0.719	0.414	1.903
GBVS	0.859	0.274	0.697	0.396	1.818
SALICON	0.848	0.304	0.738	0.375	1.838
ITTI	0.847	0.251	0.725	0.356	1.640
Shallow-Net	0.846	0.276	0.691	0.382	1.789
Fang et al.	0.845	0.307	0.674	0.395	1.787
Two-stream	0.832	0.264	0.685	0.343	1.753
Seo et al.	0.831	0.308	0.666	0.336	1.690
PQFT	0.825	0.250	0.722	0.338	1.780
AWS-D	0.823	0.228	0.750	0.306	1.631
Hou et al.	0.819	0.276	0.674	0.292	1.399
PIM-MCS	0.777	0.238	0.695	0.303	1.596
Rudoy et al.	0.763	0.271	0.637	0.344	1.619
PIM-ZEN	0.760	0.234	0.702	0.306	1.657
OBDL	0.759	0.193	0.634	0.234	1.382
PMES	0.756	0.263	0.714	0.349	1.788
MCSDM	0.756	0.228	0.626	0.230	1.091
PNSP-CS	0.755	0.210	0.628	0.218	1.091
MSM-SM	0.752	0.262	0.634	0.280	1.584
MAM	0.669	0.213	0.624	0.218	1.130

DIEM video saliency leaderboard

Notes: Test set i: use the first 300 frames of 20 test videos for testing; Test set ii: use 17 of 20 test videos for testing; Test set iii: use all frames of all 20 test videos for testing

Test set	Method	AUC-J	SIM	s-AUC	CC	NSS
iii	TMFI	0.920	0.604	-	0.740	3.031
iii	STSANet	0.905	0.548	-	0.690	2.787
ii	TMFI	0.921	0.598	-	0.726	2.956
ii	STSANet	0.906	0.541	-	0.677	2.721
ii	TSFP-Net (Test set ii)	0.906	0.527	-	0.651	2.62
ii	TSFP-Net	0.905	0.529	-	0.649	2.63
ii	ViNet (with audio)	0.899	0.498	0.719	0.632	2.53
ii	ViNet	0.898	0.483	0.723	0.626
i	TMFI	0.916	0.565	-	0.692	2.955
i	STSANet	0.901	0.505	-	0.625	2.618
i	ACLNet	0.881	0.277	0.693	0.396	2.368
i	STRA-Net	0.870	0.306	0.678	0.408	2.452
i	DVA	0.868	0.237	0.721	0.386	2.347
i	Two-stream	0.859	0.256	0.682	0.366	2.171
i	DeepVS	0.857	0.238	0.693	0.371	2.235
i	Deep-Net	0.849	0.164	0.697	0.291	1.650
i	Shallow-Net	0.838	0.188	0.620	0.297	1.646
i	Fang et al.	0.823	0.167	0.636	0.251	1.423
i	GBVS	0.813	0.156	0.633	0.214	1.198
i	SALICON	0.793	0.171	0.674	0.270	1.650
i	ITTI	0.791	0.132	0.653	0.196	1.103
i	Rudoy et al.	0.775	0.150	0.618	0.260	1.390
i	AWS-D	0.774	0.150	0.695	0.216	1.252
i	OBDL	0.762	0.165	0.694	0.221	1.289
i	Hou et al.	0.735	0.142	0.589	0.128	0.735
i	PQFT	0.724	0.126	0.649	0.144	0.856
i	Seo et al.	0.723	0.130	0.568	0.116	0.665
i	MCSDM	0.663	0.105	0.558	0.084	0.466
i	PIM-MCS	0.662	0.110	0.607	0.124	0.709
i	PIM-ZEN	0.660	0.114	0.615	0.132	0.757
i	PMES	0.657	0.122	0.607	0.142	0.817
i	PNSP-CS	0.637	0.091	0.559	0.074	0.417
i	MSM-SM	0.619	0.092	0.571	0.107	0.624
i	MAM	0.579	0.089	0.552	0.072	0.408

LEDOV video saliency leaderboard

Note: the listed results are provided by the author of LEDOV (Lai Jiang: jianglai.china@gmail.com), and the used evaluation codes (https://github.com/remega/LEDOV-eye-tracking-database/tree/master/metrics) are different from ours (https://github.com/wenguanwang/DHF1K).

Method	AUC-J	NSS	CC	KL
DeepVS	0.902	2.999	0.586	1.222
ACLNet	0.897	2.872	0.570	1.445
SalGAN	0.868	2.193	0.428	1.680
DVA	0.885	2.840	0.557	1.323
SAlLICON	0.851	2.332	0.437	1.635
Sal-DCNN	0.892	2.838	0.573	1.304
GBVS	0.839	1.541	0.322	1.824
Rudoy et al.	0.799	1.454	0.320	2.421
AWS-D	0.795	1.365	0.294	2.023
PQFT	0.699	0.690	0.140	2.461
OBDL	0.801	1.545	0.315	2.053
Xu et al.	0.827	1.475	0.382	1.652
BMS	0.757	0.979	0.214	2.225

Mthods

VSFT: Video Saliency Forecasting Transformer, C. Ma, H. Sun, Y. Rao, J. Zhou, J. Lu, TCSVT, 2022.
STSANet: Spatio-Temporal Self-Attention Network for Video Saliency Prediction, Z. Wang, Z. Liu, G. Li, T. Zhang, L. Xu and J. Wang, TMM, 2021.
TSFP-Net: Temporal-Spatial Feature Pyramid for Video Saliency Detection, Q. Chang, S. Zhu, and L. Zhu, arXiv 2105.04213, 2021.
ViNet: Diving Deep into Audio-Visual Saliency Prediction, S. Jain, P. Yarlagadda, R. Subramanian, and V. Gandhi, arxiv 2012.06170, 2020.
HD2S: Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction, G. Bellitto, F. Proietto Salanitri, S. Palazzo, F. Rundo, D. Giordano, and C. Spampinato, IJCV, 2021.
UNISAL: Unified Image and Video Saliency Modeling, R. Droste, J. Jiao, and J. A. Noble, ECCV, 2020.
SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-based ConvLSTM, X. Wu, Z. Wu, J. Zhang, L. Ju, and S. Wang, AAAI, 2020
UVA-Net: Ultrafast Video Attention Prediction with Coupled Knowledge Distillation, K. Fu, P. Shi, Y. Song, S. Ge, X. Lu, and J. Li, AAAI, 2020
TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection, K. Min and J. J. Corso, ICCV, 2019
STRA-Net: Video Saliency Prediction using Spatiotemporal Residual Attentive Networks, Q. Lai, W. Wang, H. Sun, and J. Shen, IEEE Transactions on Image Processing, 2019
SalEMA: Simple vs complex temporal recurrences for video saliency prediction, P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-i-Nieto, and K. McGuinness, BMVC, 2019
ACLNet: Revisiting Video Saliency: A Large-scale Benchmark and a New Model, W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji, IEEE CVPR, 2018.
ITTI: A model of saliency-based visual attention for rapid scene analysis, L. Itti, C. Koch, and E. Niebur, IEEE TPAMI, 1998.
GBVS: Graph-based visual saliency, J. Harel, C. Koch, and P. Perona, NIPS, 2007.
SALICON: SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks, X. Huang, C. Shen, X. Boix, and Q. Zhao, IEEE ICCV, 2015.
Shallow-Net: Shallow and deep convolutional networks for saliency prediction, J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, IEEE CVPR, 2016.
Deep-Net: Shallow and deep convolutional networks for saliency prediction, J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, IEEE CVPR, 2016.
DVA: Deep visual attention prediction, IEEE Transactions on Image Processing, W. Wang and J. Shen, 2018.
PQFT: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, C. Guo and L. Zhang, IEEE TIP, 2010.
Seo et al.: Static and space-time visual saliency detection by self-resemblance, H. J. Seo and P. Milanfar, Journal of Vision, 2009.
Rudoy et al.: Learning video saliency from human gaze using candidate selection, D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik-Manor,IEEE CVPR, 2013.
Hou et al.: Dynamic visual attention: Searching for coding length increments, X. Hou and L. Zhang, NIPS, 2008.
Fang et al.: Video saliency incorporating spatiotemporal cues and uncertainty weighting, Y. Fang, Z. Wang, W. Lin, and Z. Fang, IEEE TIP, 2014.
OBDL: How many bits does it take for a stimulus to be salient? S. Hossein Khatoonabadi, N. Vasconcelos, I. V. Bajic, and Y. Shan, IEEE CVPR, 2015.
AWS-D: Dynamic whitening saliency, V. Leboran, A. Garcia-Diaz, X. R. Fdez-Vidal, and X. M. Pardo, IEEE TPAMI, 2017.
PMES: A new perceived motion based shot content representation, Y.-F. Ma and H.-J. Zhang, ICIP, 2001.
MAM: A fast algorithm to find the region-of-interest in the compressed mpeg domain, G. Agarwal, A. Anbu, and A. Sinha, IEEE ICME, 2003.
PIM-ZEN: A model of motion attention for video skimming, Y.-F. Ma and H.-J. Zhang, ICIP, 2002.
PIM-MCS: Region-of-interest based compressed domain video transcoding scheme, A. Sinha, G. Agarwal, and A. Anbu, IEEE ICASSP, 2004.
MCSDM: A motion attention model based rate control algorithm for h.264/avc, Z. Liu, H. Yan, L. Shen, Y. Wang, and Z. Zhang, ICCIS, 2009.
MSM-SM: Salient motion detection in compressed domain, K. Muthuswamy and D. Rajan, IEEE SPL, 2013.
PNSP-CS: A video saliency detection model in compressed domain, Y. Fang, W. Lin, Z. Chen, C. M. Tsai, and C. W. Lin, IEEE CSVT, 2014
DeepVS: DeepVS: A Deep Learning Based Video Saliency Prediction Approach, L. Jiang, M. Xu, and Z. Wang, ECCV, 2018.
Two-stream: Spatio-temporal saliency networks for dynamic saliency prediction, C. Bak, A. Kocak, E. Erdem, and A. Erdem, IEEE TMM, 2017.
SalGAN: SalGAN: Visual Saliency Prediction with Generative Adversarial Networks, J. Pan, E. Sayrol, E. Giro-i-Nieto, C. C. Ferrer, J. Torres, K. McGuiness, N. E. O’Connor, IEEE CVPR-workshop, 2017.
Sal-DCNN: Image Saliency Prediction in Transformed Domain: A Deep Complex Neural Network Method, L. Jiang, Z. Wang, and M. Xu, AAAI, 2018
Xu et al.: Learning to detect video saliency with HEVC features, M. Xu, L. Jiang, X. Sun, IEEE TIP, 2017
BMS: Exploiting surroundedness for saliency detection: a Boolean map approach, J. Zhang and S. Sclaroff, IEEE PAMI, 2016

Eye-tracking datasets

dynamic scenes:

static scenes:

12 thoughts on “Revisiting Video Saliency Prediction in the Deep Learning Era”

Pingback: 视频显著性检测模型性能排行榜 - StubbornHuang Blog
Chenming Li

22/02/2024 at 16:42

老师以及测试人员，您好！我将自己的模型方法在DHF1K上的测试集按照Github上的要求模型名称/视频名称/显著图 .png的格式发送至邮箱，一直没有回复。请问目前还可以回复吗？
- MM ChengPost author
  
  24/03/2024 at 12:32
  
  之前邮件已经回复了。我们也考虑尽量自动化
晨鸣Li

22/02/2024 at 16:41

老师以及测试人员，您好！我将自己的模型方法在DHF1K上的测试集按照Github上的要求模型名称/视频名称/显著图 .png的格式发送至邮箱，一直没有回复。请问目前还可以回复吗？
Tom

18/10/2022 at 09:24

Hello, teacher. As mentioned in your github of DHF1K: “Note that, for Holly-wood2 dataset, we used the split videos (each video only contains one shot), instead of the full videos.”. Can you tell us which shot segmentation algorithm is used? Thanks.
- Wenguan Wang
  
  07/12/2022 at 11:42
  
  The videos of Holly-wood2 already come with the shot split.
HuHao

02/04/2021 at 17:13

老师，您好。我将自己的在DHF1K上的按照模型名称/视频名称/显著图 .png的格式发送至邮箱，一直没有回复。请问这个是随机进行测试的吗？
- Wenguan Wang
  
  14/05/2021 at 12:42
  
  check一下邮箱地址：dhf1kdataset@gmail.com
  
  我们一直在回复。
  - Tom
    
    17/08/2023 at 11:21
    
    请问老师，目前还在回复吗？我最近连续两周发送的邮件都没有收到答复，麻烦催促一下。谢谢。
    - MM ChengPost author
      
      21/08/2023 at 09:00
      
      谢谢提醒，我们check一下问题
KrisCao

31/07/2020 at 09:55

Excellent job.
I am wondering what does the time under DHF1K video saliency leaderboard means? Is it reference time per frame? Since certain models downsample the video in temporal domain, do you take this under consideration?
- MM ChengPost author
  
  04/08/2020 at 10:51
  
  Thanks for your interest.
  Yes.
  It is the per frame computation time.
  The frame rate does not influence the per frame speed. In addition, according to our benchmarking settings, the results submitted our your evaluation server should generated with a unified frame rate.
  –from Wenguan Wang