Research

Shifting More Attention to Video Salient Object Detection

Deng-Ping Fan1, Wenguan Wang2, Ming-Ming Cheng1, Jianbing Shen2,3

1TKLNDST, CS, Nankai University      2Inception Institute of Artificial Intelligence (IIAI)     3Beijing Institute of Technology

Abstract

The last decade has witnessed a growing interest in video salient object detection (VSOD). However, the research community long-term lacked a well-established VSOD dataset representative of real dynamic scenes with high-quality annotations. To address this issue, we elaborately collected a visual-attention-consistent Densely Annotated VSOD (DAVSOD) dataset, which contains 226 videos with 23,938 frames that cover diverse realistic-scenes, objects, instances, and motions. With corresponding real human eye-fixation data, we obtain precise ground-truths. This is the first work that explicitly emphasizes the challenge of saliency shift, i.e., the video salient object(s) may dynamically change. To further contribute the community a complete benchmark, we systematically assess 17 representative VSOD algorithms over seven existing VSOD datasets and our DAVSOD with totally ~84K frames (largest-scale). Utilizing three famous metrics, we then present a comprehensive and insightful performance analysis. Furthermore, we propose a baseline model. It is equipped with a saliency shift-aware convLSTM, which can efficiently capture video saliency dynamics through learning human attention-shift behavior. Extensive experiments1 open up promising future directions for model development and comparison.

Notion of saliency shift

The saliency shift is not just represented as a binary signal, w.r.t., whether it happens in a certain frame. Since we focus on an object-level task, we change the saliency values of different objects according to the shift of human attention.

Figure 1: Annotation examples of our DAVSOD dataset. The rich annotations, including saliency shift, object-/instance-level ground-truths (GT), salient object numbers, scene/object categories, and camera/object motions, provide a solid foundation for VSOD task and benefit a wide range of potential applications.

Paper

Most related projects on this website

Statistics of the proposed DAVSOD

Figure 2: Statistics of the proposed DAVSOD dataset. (a) Scene/object categories. (b, c) Distribution of annotated instances and image frames, respectively. (d) Ratio distribution of the objects/instances. (e) Mutual dependencies among scene categories in (a).

Downloads

1.DAVSOD dataset.

New!!! The DAVSOD-name-v2.xlsx deleted some description of missing videos (0124、0291、0413、0556). Some video like 0189_1 share with the same attributes with 0189. These shared videos including: 0064_1, 0066_1, 0182_1, 0189_1, 0256_1, 0616_1, 0675_1, 0345_1, 0590_2, 0318_1, 0328_1, 0590_1, 0194_1, 0321_1, 0590_3

Note that we merge the small sequence of the training set. Thus, there are only 61 sequences which are different from the CVPR 2019 paper (90 training sequences).

Figure 3: Sample sequences from our dataset, with instance-level ground truth segmentation masks and fixation map overlayed. Please refer to the accompanying video (VisualSaliency.mp4) for a visualization of the dataset.

Figure 4: Examples of cases that passed or were rejected segmentation in the verification stage.
Figure 5: Example sequence of saliency shift considered in the proposed DAVSOD dataset. Different (5th row) from tradition work which only labels all of the salient object (2th row) via static frames, without a dynamic human eye-fixation guided annotation methodology. The proposed DAVSOD dataset were strictly annotated according to real human fixation record (3rd row), thus reveal real human attention mechanism during dynamic viewing.

Table 1: Statistics of previous VSOD datasets and the proposed DAVSOD dataset. From left to right: number of videos (#Vi.), number of annotated frames (#AF.), high quality annotation for dataset (HQ), large (> 100 videos) size of the dataset (SIZE), whether consider attention shift (AS) phenomena, whether annotate salient objects according to human fixation points (FP), whether offer the eye fixation points of annotated salient objects (EF), whether provide instance-level annotation (IL), whether provide video description (DE), whether provide attribute annotation (AN), densely (per-frame) labeling (DL). Our dataset is the only one meeting all requirements. SegV1, and SegV2 originally introduced to evaluate tracking algorithms and then widely used for video segmentation and VSOD. SegV1 is a subset of SegV2. For the video clip of penguins in SegV2, only several penguins in the center are annotated as salient objects in the original ground truths, thus Liu et al. [12] relabeled this sequence to generate the higher quality annotations.

2. SSAV model

  • caffe version: https://github.com/DengPingFan/DAVSOD
  • pytorch version: coming soon. (maybe after the deadline of  CVPR2020)
  • Tensorflow version:  coming soon. (maybe  after the deadline of CVPR2020)

3. Supplemental materials.

4. Popular Existing Datasets

Previous datasets have a different format of the original image and the ground truth map. We reorganized the formate with a unified expression. Such as, we save all of the original images with *.jpg format and from the index of zero (e.g., 00000.jpg). The ground-truth image saved with the format of *.png.  Overall datasets (VOS_test, DAVIS, FBMS, MCL, ViSal, SegTrack-V1,SegTrack-V2,UVSD)  download link click here.

Refer to http://dpfan.net/davsod/

Note that we do not include the “penguin sequence” of the SegTrack-V2 dataset due to its inaccurate segmentation. Thus the results of the MCL dataset only contains 13 sequences in our benchmark. For VOS dataset, we only benchmark the test sequence divided by our benchmark (Traning Set: Val Set: Test Set = 6:2:2).

5. Papers & Codes & Results (it will continue update for the convenience of research)

Note that: We have spent about half a year to execute all of the codes. So you can download all the results directly. Please cite our paper if you use our results. The overall results link click here (Baidu|Google)  (Update: 2019-11-17)

Refer to http://dpfan.net/davsod/

6. Leaderboard

Table 2: Summarizing of 36 previous representative VSOD methods and the proposed SSAV model. Training Set: 10C = 10-Clips [24]. S2 = SegV2 [40]. DV = DAVIS [59]. DO = DUT-OMRON [84]. MK = MSRA10K [12]. MB = MSRA-B [51]. FS = FBMS [56]. Voc12= PASCAL VOC2012 [16]. Basic: CRF = Conditional Random Field. SP = superpixel. SORM = self-ordinal resemblance measure. MRF = Markov Random Field. Type: T = Traditional. D = Deep learning. OF: Whether use optical flow. SP: Whether to use superpixel over-segmentation. S-measure [18]: The range of scores over the 8 datasets in Table 4. PCT: Per-frames Computation Time (second). Since [3, 7, 11, 33, 44, 47, 68, 93] did not release implementations, corresponding PCTs are borrowed from their papers or provided by authors. Code: M = Matlab. Py = Python. Ca= Caffe. N/A = Not Available in the literature. “*” indicates CPU time.

New paper:

  1. Motion Guided Attention for Video Salient Object Detection, ICCV, 2019.
  2. Semi-Supervised Video Salient Object Detection Using Pseudo-Labels, ICCV, 2019.
  3. RANet: Ranking Attention Network for Fast Video Object Segmentation, ICCV, 2019.
  4. Saliency-Aware Convolution Neural Network for Ship Detection in Surveillance Video, TCSVT, 2019, (application for ship detection)

Table 3: Benchmarking results of 17 state-of-the-art VSOD models on 7 datasets: SegV2 [8], FBMS [14], ViSal [23], MCL [7], DAVIS [16], UVSD [12], VOS [10] and the 35 easy test set in the proposed DAVSOD. Note that TIMP was only tested on 9 short sequences of VOS because it cannot handle long videos. “**” indicates the model has been trained on this dataset. “-T” indicates the results on the test set of this dataset. “y” indicates deep learning model. “score” indicates the GPU time. The top three models are marked in red, blue, and green color, respectively.

Figure 6: Visual comparisons with top 3 deep (MBNM [44], FGRN [41], PDBM [67]) models and 2 traditional classical (SFLR [8], SAGM [74]) models on the proposed DAVSOD dataset. Our SSAV model captures the saliency shift phenomenon successfully. Our SSAV model captures the saliency shift successfully (from frame-1 to frame-5: cat->[cat; box]->cat->box->[cat; box]). However, the other top-performance VSOD models either do not highlight the whole salient objects (e.g., SFLR, SAGM) or only capture the moving cat (e.g., MBNM).

Acknowledgements

We would like to thank the authors of DAVIS, DHF1K, and VOS dataset for their work. They provide tremendous efforts in these datasets to boost this field.

FAQs

(Visited 5,378 times, 1 visits today)
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments