Revisiting Video Saliency Prediction in the Deep Learning Era
Wenguan Wang1 Jianbing Shen1 Jianwen Xie2 Ming-Ming Cheng3 Haibin Ling4 Ali Borji5
1BLIIT, Beijing Institute of Technology 2Hikvision Research 3CCCE, Nankai University 4Temple University 5Markable AI
Abstract
Visual attention in static images has recently attracted a lot of research interests. However predicting visual attention in general, dynamic scenes has been very few touched. In this work, we contribute to video saliency research in two ways. First, we introduce a new benchmark for predicting human eye movements during dynamic scene free-viewing, which is long-time urged in this field. Our dataset, named DHF1K (Dynamic Human Fixation), consists of 1K high-quality, elaborately selected video sequences annotated by 17 observers with eye tracker equipment. Those videos are captured from spanning a large range of scenes, motions, object types, and background complexity. Existing video saliency datasets lack variety and generality of common dynamic scenes and fall short in covering challenging situations in unconstrained environments. In contrast, DHF1K makes a significant leap in terms of scalability, diversity, and difficulty, and is expected to boost video saliency modeling. Second, we propose a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. We thoroughly examine the performance of our model (ACLNet, Attentive CNN-LSTM Network), with respect to state-of-the-art saliency models, on three large-scale datasets (i.e., DHF1K, Hollywood2, UCF sports). Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that our model outperforms other competitors and has a fast processing speed (10fps; including all steps on one GPU).
Paper
- Revisiting Video Saliency Prediction in the Deep Learning Era, Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, Ali Borji, IEEE TPAMI, 2021. [pdf] [bib] [project page] [official version] [source code]
- Revisiting Video Saliency: A Large-scale Benchmark and a New Model, Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, Ali Borji, IEEE CVPR, 2018. [pdf] [source code] [bib] [project page]
Related Project
- Shifting More Attention to Video Salient Object Detection, Deng-Ping Fan, Wenguan Wang, Ming-Ming Cheng, Jianbing Shen. IEEE CVPR, 2019, Oral presentation, Best Paper Finalist, Accept rate: 0.87% [45/5160][project page | bib | official version | 中文版pdf ][poster | oral ppt | oral video | Code | Results | DAVSOD Dataset (Baidu [fetch code: ivzo]| Google)]
DHF1K Dataset
Our dataset contains annotated 1000 videos, splitting to 600 training (001.AVI-600.AVI), 100 validation (601.AVI-700.AVI) and 300 testing (701.AVI-1000.AVI). The annotations for the training and validation sets are released, but the annotations of the testing set are held-out for benchmarking. Detailed instructions for results submitting can be found here.
@ARTICLE{wang2019revisiting, author={Wenguan Wang and Jianbing Shen and Jianwen Xie and Ming-Ming Cheng and Haibing Ling and Ali Borji}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, title={Revisiting Video Saliency Prediction in the Deep Learning Era}, year={2019}, } @inproceedings{wang2018revisiting, title={Revisiting Video Saliency: A Large-scale Benchmark and a New Model}, author={Wang, Wenguan and Shen, Jianbing and Guo, Fang and Cheng, Ming-Ming and Borji, Ali}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, year={2018}, }
Contact
If you have any question, drop us an e-mail at <wenguanwang.ai@gmail.com>.
Evaluation Code
https://github.com/wenguanwang/DHF1K/blob/master/ACL-evaluation.rar
DHF1K video saliency leaderboard
Notes: DLM: Deep Learning Models. D/S: Dynamic (D) or Static (S) Models. Python is the default programming language when standard deep learning kits (e.g. TensorFlow, Caffe, or Theano) are used. Our default testing environment uses 1 Titian X GPU and 4.0GHz Intel CPU.
Method | AUC-J | SIM | s-AUC | CC | NSS | Implement. | Size (MB) | Time (s) | DLM | D/S |
---|---|---|---|---|---|---|---|---|---|---|
SalFoM | 0.9222 | 0.4208 | 0.7352 | 0.5692 | 3.3536 | PyTorch | 1574 | 0.6 | √ | D |
TMFI | 0.9153 | 0.4068 | 0.7306 | 0.5461 | 3.1463 | PyTorch | 234 | 0.033 | √ | D |
THTD-Net | 0.9152 | 0.4062 | 0.7296 | 0.5479 | 3.1385 | PyTorch | 220 | 0.08 | √ | D |
STSANet | 0.9125 | 0.3829 | 0.7227 | 0.5288 | 3.0103 | PyTorch | 643 | 0.035 (one Titan Xp GPU and 3.2GHz Intel CPU) | √ | D |
TSFP-Net | 0.9116 | 0.3921 | 0.7230 | 0.5168 | 2.9665 | PyTorch | 58.4 | 0.011 | √ | D |
VSFT | 0.9109 | 0.4109 | 0.7200 | 0.5185 | 2.9773 | PyTorch | 71.4 | 0.04 | √ | D |
HD2S | 0.908 | 0.406 | 0.700 | 0.503 | 2.812 | PyTorch | 116 | 0.03 | √ | D |
ViNet | 0.908 | 0.381 | 0.729 | 0.511 | 2.872 | PyTorch | 124 | 0.016 | √ | D |
UNISAL | 0.901 | 0.390 | 0.691 | 0.490 | 2.776 | PyTorch | 15.5 | 0.009 | √ | D&S |
SalSAC | 0.896 | 0.357 | 0.697 | 0.479 | 2.673 | PyTorch | 93.5 | 0.02 | √ | D |
TASED-Net | 0.895 | 0.361 | 0.712 | 0.470 | 2.667 | PyTorch | 82 | 0.06 | √ | D |
STRA-Net | 0.895 | 0.355 | 0.663 | 0.458 | 2.558 | Tensorflow | 641 | 0.02 | √ | D |
SalEMA | 0.890 | 0.466 | 0.667 | 0.449 | 2.574 | PyTorch | 364 | 0.01 | √ | D |
ACLNet | 0.890 | 0.315 | 0.601 | 0.434 | 2.354 | Tensorflow | 250 | 0.02 | √ | D |
SalGAN | 0.866 | 0.262 | 0.709 | 0.370 | 2.043 | Theano | 130 | 0.02 | √ | S |
DVA | 0.860 | 0.262 | 0.595 | 0.358 | 2.013 | Caffe | 96 | 0.1 | √ | S |
SALICON | 0.857 | 0.232 | 0.590 | 0.327 | 1.901 | Caffe | 117 | 0.5 | √ | S |
DeepVS | 0.856 | 0.256 | 0.583 | 0.344 | 1.911 | Tensorflow | 344 | 0.05 | √ | D |
Deep-Net | 0.855 | 0.201 | 0.592 | 0.331 | 1.775 | Caffe | 103 | 0.08 | √ | S |
Two-stream | 0.834 | 0.197 | 0.581 | 0.325 | 1.632 | Caffe | 315 | 20 | √ | D |
UVA-Net | 0.833 | 0.241 | 0.582 | 0.307 | 1.536 | - | - | 1/2588 | ||
Shallow-Net | 0.833 | 0.182 | 0.529 | 0.295 | 1.509 | Theano | 2500 | 0.1 | √ | S |
GBVS | 0.828 | 0.186 | 0.554 | 0.283 | 1.474 | C | 2.7 | S | ||
Fang et al. | 0.819 | 0.198 | 0.537 | 0.273 | 1.539 | Matlab | 147 | D | ||
ITTI | 0.774 | 0.162 | 0.553 | 0.233 | 1.207 | Matlab | 0.9 | S | ||
Rudoy et al. | 0.769 | 0.214 | 0.501 | 0.285 | 1.498 | Matlab | 180 | D | ||
Hou et al. | 0.726 | 0.167 | 0.545 | 0.150 | 0.847 | Matlab | 0.7 | D | ||
AWS-D | 0.703 | 0.157 | 0.513 | 0.174 | 0.940 | Matlab | 9 | D | ||
PQFT | 0.699 | 0.139 | 0.562 | 0.137 | 0.749 | Matlab | 1.2 | D | ||
OBDL | 0.638 | 0.171 | 0.500 | 0.117 | 0.495 | Matlab | 0.8 | D | ||
Seo et al. | 0.635 | 0.142 | 0.499 | 0.070 | 0.334 | Matlab | 2.3 | D | ||
MCSDM | 0.591 | 0.110 | 0.500 | 0.047 | 0.247 | Matlab | 15 | D | ||
MSM-SM | 0.582 | 0.143 | 0.500 | 0.058 | 0.245 | Matlab | 8 | D | ||
PIM-ZEN | 0.552 | 0.095 | 0.498 | 0.062 | 0.280 | Matlab | 43 | D | ||
PIM-MCS | 0.551 | 0.094 | 0.499 | 0.053 | 0.242 | Matlab | 10 | D | ||
MAM | 0.551 | 0.108 | 0.500 | 0.041 | 0.214 | Matlab | 778 | D | ||
PMES | 0.545 | 0.093 | 0.502 | 0.055 | 0.237 | Matlab | 579 | D |
Hollywood-2 video saliency leaderboard
Method | AUC-J | SIM | s-AUC | CC | NSS |
---|---|---|---|---|---|
TMFI | 0.940 | 0.607 | - | 0.739 | 4.095 |
STSANet | 0.938 | 0.579 | - | 0.721 | 3.927 |
VSFT | 0.936 | 0.577 | 0.811 | 0.703 | 3.916 |
TSFP-Net | 0.936 | 0.571 | - | 0.711 | 3.910 |
HD2S | 0.936 | 0.551 | 0.807 | 0.670 | 3.352 |
UNISAL | 0.934 | 0.542 | 0.795 | 0.673 | 3.901 |
ViNet | 0.930 | 0.550 | 0.813 | 0.693 | 3.73 |
SalSAC | 0.931 | 0.529 | 0.712 | 0.670 | 3.356 |
STRA-Net | 0.923 | 0.536 | 0.774 | 0.662 | 3.478 |
SalEMA | 0.919 | 0.487 | 0.708 | 0.613 | 3.186 |
TASED-Net | 0.918 | 0.507 | 0.768 | 0.646 | 3.302 |
ACLNet | 0.913 | 0.542 | 0.757 | 0.623 | 3.086 |
DeepVS | 0.887 | 0.356 | 0.693 | 0.446 | 2.313 |
DVA | 0.886 | 0.372 | 0.727 | 0.482 | 2.459 |
Deep-Net | 0.884 | 0.300 | 0.736 | 0.451 | 2.066 |
Two-stream | 0.863 | 0.276 | 0.710 | 0.382 | 1.748 |
Fang et al. | 0.859 | 0.272 | 0.659 | 0.358 | 1.667 |
SALICON | 0.856 | 0.321 | 0.711 | 0.425 | 2.013 |
Shallow-Net | 0.851 | 0.276 | 0.694 | 0.423 | 1.680 |
GBVS | 0.837 | 0.257 | 0.633 | 0.308 | 1.336 |
ITTI | 0.788 | 0.221 | 0.607 | 0.257 | 1.076 |
Rudoy et al. | 0.783 | 0.315 | 0.536 | 0.302 | 1.570 |
Hou et al. | 0.731 | 0.202 | 0.580 | 0.146 | 0.684 |
PQFT | 0.723 | 0.201 | 0.621 | 0.153 | 0.755 |
PMES | 0.696 | 0.180 | 0.620 | 0.177 | 0.867 |
AWS-D | 0.694 | 0.175 | 0.637 | 0.146 | 0.742 |
MSM-SM | 0.683 | 0.180 | 0.561 | 0.132 | 0.682 |
PIM-ZEN | 0.670 | 0.167 | 0.598 | 0.134 | 0.667 |
PIM-MCS | 0.663 | 0.163 | 0.570 | 0.118 | 0.584 |
Seo et al. | 0.652 | 0.155 | 0.530 | 0.076 | 0.346 |
PNSP-CS | 0.647 | 0.146 | 0.548 | 0.077 | 0.370 |
OBDL | 0.640 | 0.170 | 0.541 | 0.106 | 0.462 |
MAM | 0.630 | 0.153 | 0.562 | 0.099 | 0.494 |
MCSDM | 0.618 | 0.147 | 0.524 | 0.067 | 0.288 |
UCF Sports video saliency leaderboard
Method | AUC-J | SIM | s-AUC | CC | NSS |
---|---|---|---|---|---|
TMFI | 0.936 | 0.565 | - | 0.707 | 3.863 |
STSANet | 0.936 | 0.560 | - | 0.705 | 3.908 |
SalSAC | 0.926 | 0.534 | 0.806 | 0.671 | 3.523 |
ViNet | 0.924 | 0.522 | 0.810 | 0.673 | 3.62 |
TSFP-Net | 0.923 | 0.561 | - | 0.685 | 3.698 |
UNISAL | 0.918 | 0.523 | 0.775 | 0.644 | 3.381 |
STRA-Net | 0.910 | 0.479 | 0.751 | 0.593 | 3.018 |
SalEMA | 0.906 | 0.431 | 0.740 | 0.544 | 2.638 |
HD2S | 0.904 | 0.507 | 0.768 | 0.604 | 3.114 |
TASED-Net | 0.899 | 0.469 | 0.752 | 0.582 | 2.920 |
ACLNet | 0.897 | 0.406 | 0.744 | 0.510 | 2.567 |
DVA | 0.872 | 0.339 | 0.725 | 0.439 | 2.311 |
DeepVS | 0.870 | 0.321 | 0.691 | 0.405 | 2.089 |
Deep-Net | 0.861 | 0.282 | 0.719 | 0.414 | 1.903 |
GBVS | 0.859 | 0.274 | 0.697 | 0.396 | 1.818 |
SALICON | 0.848 | 0.304 | 0.738 | 0.375 | 1.838 |
ITTI | 0.847 | 0.251 | 0.725 | 0.356 | 1.640 |
Shallow-Net | 0.846 | 0.276 | 0.691 | 0.382 | 1.789 |
Fang et al. | 0.845 | 0.307 | 0.674 | 0.395 | 1.787 |
Two-stream | 0.832 | 0.264 | 0.685 | 0.343 | 1.753 |
Seo et al. | 0.831 | 0.308 | 0.666 | 0.336 | 1.690 |
PQFT | 0.825 | 0.250 | 0.722 | 0.338 | 1.780 |
AWS-D | 0.823 | 0.228 | 0.750 | 0.306 | 1.631 |
Hou et al. | 0.819 | 0.276 | 0.674 | 0.292 | 1.399 |
PIM-MCS | 0.777 | 0.238 | 0.695 | 0.303 | 1.596 |
Rudoy et al. | 0.763 | 0.271 | 0.637 | 0.344 | 1.619 |
PIM-ZEN | 0.760 | 0.234 | 0.702 | 0.306 | 1.657 |
OBDL | 0.759 | 0.193 | 0.634 | 0.234 | 1.382 |
PMES | 0.756 | 0.263 | 0.714 | 0.349 | 1.788 |
MCSDM | 0.756 | 0.228 | 0.626 | 0.230 | 1.091 |
PNSP-CS | 0.755 | 0.210 | 0.628 | 0.218 | 1.091 |
MSM-SM | 0.752 | 0.262 | 0.634 | 0.280 | 1.584 |
MAM | 0.669 | 0.213 | 0.624 | 0.218 | 1.130 |
DIEM video saliency leaderboard
Notes: Test set i: use the first 300 frames of 20 test videos for testing; Test set ii: use 17 of 20 test videos for testing; Test set iii: use all frames of all 20 test videos for testing
Test set | Method | AUC-J | SIM | s-AUC | CC | NSS |
---|---|---|---|---|---|---|
iii | TMFI | 0.920 | 0.604 | - | 0.740 | 3.031 |
iii | STSANet | 0.905 | 0.548 | - | 0.690 | 2.787 |
ii | TMFI | 0.921 | 0.598 | - | 0.726 | 2.956 |
ii | STSANet | 0.906 | 0.541 | - | 0.677 | 2.721 |
ii | TSFP-Net (Test set ii) | 0.906 | 0.527 | - | 0.651 | 2.62 |
ii | TSFP-Net | 0.905 | 0.529 | - | 0.649 | 2.63 |
ii | ViNet (with audio) | 0.899 | 0.498 | 0.719 | 0.632 | 2.53 |
ii | ViNet | 0.898 | 0.483 | 0.723 | 0.626 | |
i | TMFI | 0.916 | 0.565 | - | 0.692 | 2.955 |
i | STSANet | 0.901 | 0.505 | - | 0.625 | 2.618 |
i | ACLNet | 0.881 | 0.277 | 0.693 | 0.396 | 2.368 |
i | STRA-Net | 0.870 | 0.306 | 0.678 | 0.408 | 2.452 |
i | DVA | 0.868 | 0.237 | 0.721 | 0.386 | 2.347 |
i | Two-stream | 0.859 | 0.256 | 0.682 | 0.366 | 2.171 |
i | DeepVS | 0.857 | 0.238 | 0.693 | 0.371 | 2.235 |
i | Deep-Net | 0.849 | 0.164 | 0.697 | 0.291 | 1.650 |
i | Shallow-Net | 0.838 | 0.188 | 0.620 | 0.297 | 1.646 |
i | Fang et al. | 0.823 | 0.167 | 0.636 | 0.251 | 1.423 |
i | GBVS | 0.813 | 0.156 | 0.633 | 0.214 | 1.198 |
i | SALICON | 0.793 | 0.171 | 0.674 | 0.270 | 1.650 |
i | ITTI | 0.791 | 0.132 | 0.653 | 0.196 | 1.103 |
i | Rudoy et al. | 0.775 | 0.150 | 0.618 | 0.260 | 1.390 |
i | AWS-D | 0.774 | 0.150 | 0.695 | 0.216 | 1.252 |
i | OBDL | 0.762 | 0.165 | 0.694 | 0.221 | 1.289 |
i | Hou et al. | 0.735 | 0.142 | 0.589 | 0.128 | 0.735 |
i | PQFT | 0.724 | 0.126 | 0.649 | 0.144 | 0.856 |
i | Seo et al. | 0.723 | 0.130 | 0.568 | 0.116 | 0.665 |
i | MCSDM | 0.663 | 0.105 | 0.558 | 0.084 | 0.466 |
i | PIM-MCS | 0.662 | 0.110 | 0.607 | 0.124 | 0.709 |
i | PIM-ZEN | 0.660 | 0.114 | 0.615 | 0.132 | 0.757 |
i | PMES | 0.657 | 0.122 | 0.607 | 0.142 | 0.817 |
i | PNSP-CS | 0.637 | 0.091 | 0.559 | 0.074 | 0.417 |
i | MSM-SM | 0.619 | 0.092 | 0.571 | 0.107 | 0.624 |
i | MAM | 0.579 | 0.089 | 0.552 | 0.072 | 0.408 |
LEDOV video saliency leaderboard
Note: the listed results are provided by the author of LEDOV (Lai Jiang: jianglai.china@gmail.com), and the used evaluation codes (https://github.com/remega/LEDOV-eye-tracking-database/tree/master/metrics) are different from ours (https://github.com/wenguanwang/DHF1K).
Method | AUC-J | NSS | CC | KL |
---|---|---|---|---|
DeepVS | 0.902 | 2.999 | 0.586 | 1.222 |
ACLNet | 0.897 | 2.872 | 0.570 | 1.445 |
SalGAN | 0.868 | 2.193 | 0.428 | 1.680 |
DVA | 0.885 | 2.840 | 0.557 | 1.323 |
SAlLICON | 0.851 | 2.332 | 0.437 | 1.635 |
Sal-DCNN | 0.892 | 2.838 | 0.573 | 1.304 |
GBVS | 0.839 | 1.541 | 0.322 | 1.824 |
Rudoy et al. | 0.799 | 1.454 | 0.320 | 2.421 |
AWS-D | 0.795 | 1.365 | 0.294 | 2.023 |
PQFT | 0.699 | 0.690 | 0.140 | 2.461 |
OBDL | 0.801 | 1.545 | 0.315 | 2.053 |
Xu et al. | 0.827 | 1.475 | 0.382 | 1.652 |
BMS | 0.757 | 0.979 | 0.214 | 2.225 |
Mthods
- VSFT: Video Saliency Forecasting Transformer, C. Ma, H. Sun, Y. Rao, J. Zhou, J. Lu, TCSVT, 2022.
- STSANet: Spatio-Temporal Self-Attention Network for Video Saliency Prediction, Z. Wang, Z. Liu, G. Li, T. Zhang, L. Xu and J. Wang, TMM, 2021.
- TSFP-Net: Temporal-Spatial Feature Pyramid for Video Saliency Detection, Q. Chang, S. Zhu, and L. Zhu, arXiv 2105.04213, 2021.
- ViNet: Diving Deep into Audio-Visual Saliency Prediction, S. Jain, P. Yarlagadda, R. Subramanian, and V. Gandhi, arxiv 2012.06170, 2020.
- HD2S: Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction, G. Bellitto, F. Proietto Salanitri, S. Palazzo, F. Rundo, D. Giordano, and C. Spampinato, IJCV, 2021.
- UNISAL: Unified Image and Video Saliency Modeling, R. Droste, J. Jiao, and J. A. Noble, ECCV, 2020.
- SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-based ConvLSTM, X. Wu, Z. Wu, J. Zhang, L. Ju, and S. Wang, AAAI, 2020
- UVA-Net: Ultrafast Video Attention Prediction with Coupled Knowledge Distillation, K. Fu, P. Shi, Y. Song, S. Ge, X. Lu, and J. Li, AAAI, 2020
- TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection, K. Min and J. J. Corso, ICCV, 2019
- STRA-Net: Video Saliency Prediction using Spatiotemporal Residual Attentive Networks, Q. Lai, W. Wang, H. Sun, and J. Shen, IEEE Transactions on Image Processing, 2019
- SalEMA: Simple vs complex temporal recurrences for video saliency prediction, P. Linardos, E. Mohedano, J. J. Nieto, N. E. O’Connor, X. Giro-i-Nieto, and K. McGuinness, BMVC, 2019
- ACLNet: Revisiting Video Saliency: A Large-scale Benchmark and a New Model, W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji, IEEE CVPR, 2018.
- ITTI: A model of saliency-based visual attention for rapid scene analysis, L. Itti, C. Koch, and E. Niebur, IEEE TPAMI, 1998.
- GBVS: Graph-based visual saliency, J. Harel, C. Koch, and P. Perona, NIPS, 2007.
- SALICON: SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks, X. Huang, C. Shen, X. Boix, and Q. Zhao, IEEE ICCV, 2015.
- Shallow-Net: Shallow and deep convolutional networks for saliency prediction, J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, IEEE CVPR, 2016.
- Deep-Net: Shallow and deep convolutional networks for saliency prediction, J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor, IEEE CVPR, 2016.
- DVA: Deep visual attention prediction, IEEE Transactions on Image Processing, W. Wang and J. Shen, 2018.
- PQFT: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression, C. Guo and L. Zhang, IEEE TIP, 2010.
- Seo et al.: Static and space-time visual saliency detection by self-resemblance, H. J. Seo and P. Milanfar, Journal of Vision, 2009.
- Rudoy et al.: Learning video saliency from human gaze using candidate selection, D. Rudoy, D. B. Goldman, E. Shechtman, and L. Zelnik-Manor,IEEE CVPR, 2013.
- Hou et al.: Dynamic visual attention: Searching for coding length increments, X. Hou and L. Zhang, NIPS, 2008.
- Fang et al.: Video saliency incorporating spatiotemporal cues and uncertainty weighting, Y. Fang, Z. Wang, W. Lin, and Z. Fang, IEEE TIP, 2014.
- OBDL: How many bits does it take for a stimulus to be salient? S. Hossein Khatoonabadi, N. Vasconcelos, I. V. Bajic, and Y. Shan, IEEE CVPR, 2015.
- AWS-D: Dynamic whitening saliency, V. Leboran, A. Garcia-Diaz, X. R. Fdez-Vidal, and X. M. Pardo, IEEE TPAMI, 2017.
- PMES: A new perceived motion based shot content representation, Y.-F. Ma and H.-J. Zhang, ICIP, 2001.
- MAM: A fast algorithm to find the region-of-interest in the compressed mpeg domain, G. Agarwal, A. Anbu, and A. Sinha, IEEE ICME, 2003.
- PIM-ZEN: A model of motion attention for video skimming, Y.-F. Ma and H.-J. Zhang, ICIP, 2002.
- PIM-MCS: Region-of-interest based compressed domain video transcoding scheme, A. Sinha, G. Agarwal, and A. Anbu, IEEE ICASSP, 2004.
- MCSDM: A motion attention model based rate control algorithm for h.264/avc, Z. Liu, H. Yan, L. Shen, Y. Wang, and Z. Zhang, ICCIS, 2009.
- MSM-SM: Salient motion detection in compressed domain, K. Muthuswamy and D. Rajan, IEEE SPL, 2013.
- PNSP-CS: A video saliency detection model in compressed domain, Y. Fang, W. Lin, Z. Chen, C. M. Tsai, and C. W. Lin, IEEE CSVT, 2014
- DeepVS: DeepVS: A Deep Learning Based Video Saliency Prediction Approach, L. Jiang, M. Xu, and Z. Wang, ECCV, 2018.
- Two-stream: Spatio-temporal saliency networks for dynamic saliency prediction, C. Bak, A. Kocak, E. Erdem, and A. Erdem, IEEE TMM, 2017.
- SalGAN: SalGAN: Visual Saliency Prediction with Generative Adversarial Networks, J. Pan, E. Sayrol, E. Giro-i-Nieto, C. C. Ferrer, J. Torres, K. McGuiness, N. E. O’Connor, IEEE CVPR-workshop, 2017.
- Sal-DCNN: Image Saliency Prediction in Transformed Domain: A Deep Complex Neural Network Method, L. Jiang, Z. Wang, and M. Xu, AAAI, 2018
- Xu et al.: Learning to detect video saliency with HEVC features, M. Xu, L. Jiang, X. Sun, IEEE TIP, 2017
- BMS: Exploiting surroundedness for saliency detection: a Boolean map approach, J. Zhang and S. Sclaroff, IEEE PAMI, 2016
Eye-tracking datasets
dynamic scenes:
static scenes:
[…] 地址:Revisiting Video Saliency Prediction in the Deep Learning Era […]
老师以及测试人员,您好!我将自己的模型方法在DHF1K上的测试集按照Github上的要求模型名称/视频名称/显著图 .png的格式发送至邮箱,一直没有回复。请问目前还可以回复吗?
之前邮件已经回复了。我们也考虑尽量自动化
老师以及测试人员,您好!我将自己的模型方法在DHF1K上的测试集按照Github上的要求模型名称/视频名称/显著图 .png的格式发送至邮箱,一直没有回复。请问目前还可以回复吗?
Hello, teacher. As mentioned in your github of DHF1K: “Note that, for Holly-wood2 dataset, we used the split videos (each video only contains one shot), instead of the full videos.”. Can you tell us which shot segmentation algorithm is used? Thanks.
The videos of Holly-wood2 already come with the shot split.
老师,您好。我将自己的在DHF1K上的按照 模型名称/视频名称/显著图 .png的格式发送至邮箱,一直没有回复。请问这个是随机进行测试的吗?
check一下邮箱地址:dhf1kdataset@gmail.com
我们一直在回复。
请问老师,目前还在回复吗?我最近连续两周发送的邮件都没有收到答复,麻烦催促一下。谢谢。
谢谢提醒,我们check一下问题
Excellent job.
I am wondering what does the time under DHF1K video saliency leaderboard means? Is it reference time per frame? Since certain models downsample the video in temporal domain, do you take this under consideration?
Thanks for your interest.
Yes.
It is the per frame computation time.
The frame rate does not influence the per frame speed. In addition, according to our benchmarking settings, the results submitted our your evaluation server should generated with a unified frame rate.
–from Wenguan Wang