1TNList, Tsinghua University 2UCL/KAUST 3Lehigh University 4The University of Oxford
Automatic estimation of salient object regions across images, without any prior assumption or knowledge of the contents of the corresponding scenes, enhances many computer vision and computer graphics applications. We introduce a regional contrast based salient object extraction algorithm, which simultaneously evaluates global contrast differences and spatial weighted coherence scores. The proposed algorithm is simple, efficient, naturally multi-scale, and produces full-resolution, high-quality saliency maps. These saliency maps are further used to initialize a novel iterative version of GrabCut for high quality salient object segmentation. We extensively evaluated our algorithm using traditional salient object detection datasets, as well as a more challenging Internet image dataset. Our experimental results demonstrate that our algorithm consistently outperforms existing salient object detection and segmentation methods, yielding higher precision and better recall rates. We also show that our algorithm can be used to efficiently extract salient object masks from Internet images, enabling effective sketch-based image retrieval (SBIR) via simple shape comparisons. Despite such noisy internet images, where the saliency regions are ambiguous, our saliency guided image retrieval achieves a superior retrieval rate compared with state-of-the-art SBIR methods, and additionally provides important target object region information.
- Global Contrast based Salient Region detection. Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, Shi-Min Hu. IEEE TPAMI, 2015. [Pdf] [Poster] [Bib] [CVPR 2011 version] [中文版] [C++] (#2 most cited paper in CVPR 2011)
Most related projects on this website:
- Efficient Salient Region Detection with Soft Image Abstraction. Ming-Ming Cheng, Jonathan Warrell, Wen-Yan Lin, Shuai Zheng, Vibhav Vineet, Nigel Crook. IEEE International Conference on Computer Vision (IEEE ICCV), 2013. [pdf] [Project page] [bib] [latex] [official version]
- BING: Binarized Normed Gradients for Objectness Estimation at 300fp, Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, Philip H. S. Torr, IEEE International Conference on Computer Vision and Pattern Recognition (IEEE CVPR), 2014. [Project page][pdf][bib] (Oral, Accept rate: 5.75%)
- SalientShape: Group Saliency in Image Collections. Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Shi-Min Hu. The Visual Computer 30 (4), 443-453, 2014. [pdf] [Project page] [bib] [latex] [Official version]
The MSRA10K benchmark dataset (a.k.a. THUS10000) comprises of per-pixel ground truth annotation for 10, 000 MSRA images (181 MB), each of which has an unambiguous salient object and the object region is accurately annotated with pixel wise ground-truth labeling (13.1M). We provide saliency maps (5.3 GB containing 170, 000 image) for our methods as well as other 15 state of the art methods, including FT , AIM , MSS , SEG , SeR , SUN , SWD , IM , IT , GB , SR , CA , LC , AC , and CB . Saliency segmentation (71.3MB) results for FT, SEG, and CB are also available.
2. Windows executable
We supply an windows msi for install our prototype software, which includes our implementation for FT, SR, LC, our HC, RC and saliency cut method.
3. C++ source code
The C++ implementation of our paper as well as several other state of the art works.
4. Supplemental material
Supplemental materials (647 MB) including comparisons with other 15 state of the art algorithms are now available.
Salient object detection results for images with multiple objects. We tested it on the dataset provided by the CVPR 2007 paper: “Image Segmentation by Probabilistic Bottom-Up Aggregation and Cue Integration”.
5. More results for recent methods
If anyone want to share their results on our MSRA10K benchmark (facilitate other researchers to compare with recent methods), please contact me via email (see the header image of this project page for it). I will put your results as well as paper links in this page.
Comparisons with state of the art methods
Figure. Statistical comparison results of (a) different saliency region detection methods, (b) their variants, and (c) object of interest region segmentation methods, using largest public available dataset (i) and (ii) our MSRA10K dataset (to be made public available). We compare our HC method and RC method with 15 state of art methods, including FT , AIM , MSS , SEG , SeR , SUN , SWD , IM , IT , GB , SR , CA , LC , AC , and CB . We also take simple variable-size Gaussian model ‘Gau’ and GrabCut method as a baseline. (Please see our paper for detailed explaintions)
|Figure. Comparison of average Fβ for different saliency segmentation methods: FT , SEG , and ours, on THUR15K dataset, which is composed by non-selected internet images.|
Table. Average time taken to compute a saliency map for images in the MSRA10K database. (Note that we use the authors original implementations for MSS and FT, which is not well optimized code.)
Table. Comparison of average time for different saliency segmentation methods.
Figure. Saliency maps computed by different state-of-the-art methods~(b-p), and with our proposed HC~(q) and RC methods~(r). Most results highlight edges, or are of low resolution. See also the shared data for saliency detection results for the whole MSRA10K dataset.
Figure. Sketch based image comparison. In each group from left to right, first column shows images download from Flickr using the corresponding keyword; second column shows our retrieval results obtained by comparing user input sketch with SaliencyCut result using shape context measure ; third column shows corresponding sketch based retrieval results using SHoG .
Until now, more than 2000+ readers (according to email records) have request to get the source code for this project. Some of them have questions about using the code. Here are some frequently asked questions (some of them are frequently asked questions from many reviewers as well) for new users to refer:
Q1: I’m confused with the sentence in the paper: “In our experiments, the threshold is chosen empirically to be the threshold that gives 95% recall rate in our fixed thresholding experiments”. But all most the case, people have not the ground truth, so cannot compute the call rate. When I use your Cut application, I need to guess threshold value to have good cut image.
A: The recall rate is just used to evaluate the algorithm. When you use it, you typically don’t have to evaluate the algorithm itself very often. This sentence is used to explain what the fixed threshold we use typically means. Actually, when initialized using RC saliency maps, this threshold is 70 with saliency values normalized to [0,255]. It doesn’t mean that the saliency values corresponds to recall rate of 95% for every image, but empirically corresponds to recall rate of 95% for a large number of images. So, just use the suggested threshold of 70 is OK.
Q2: I use your code to get results for the same database you used. But the results seem to have some small difference from yours.
A: It seems that the cvtColor function in OpenCV 1.x is different from those in OpenCv 2.X. I suggest users to use those in recent versions. The segmentation method I used sometimes generates strange results, leading to strange results of saliency maps. This happens at low frequency. When this happens, I rerun the exe again and it becomes OK. I don’t know why, but this really happens when I use the exe first time after compiling (Very strange, maybe because some default initializations). If someone find the bug, please report to me.
Q3: Does your algorithm only get good results for images with single salient object?
A: Mostly yes. As described in our paper, our method is suitable for images with an unambiguous saliency object. Since saliency detection methods typically have no prior knowledge about the target object, thus is very difficult. Much recent researches focus on images with single saliency object. Even for this simple case, state of the art algorithm may also fail. It’s understandable since supervised object detection which uses a large number of training data and prior knowledge also fails in many cases.
However, the value of saliency detection methods lies on their applications in many fields. Because they don’t need large human annotation for learning, and typically much faster than object detection methods, it’s possible to automatically process a large number of images with low cost. Although many of the saliency detection results may be wrong (up to 60% for noise internet image) because of the ambiguous or even missing of salient objects, we can still use efficient algorithms to select those good results and use them in many interesting applications like (Notes: all following projects use our saliency source code, with initial version of SaliencyCut used in our own Sketch2Photo project. Click here for a list of 500+ citations to the CVPR11 version):
- Unsupervised joint object discovery and segmentation in internet images, M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, in IEEE CVPR, 2013, pp. 1939–1946. (Used the proposed saliency measure and showed that saliency-based segmentation produces state-of-the-art results on co-segmentation benchmarks, without using co-segmentation!)
- Image retrieval: Sketch2Photo: Internet Image Montage. Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, Shi-Min Hu. ACM SIGGRAPH Asia. 28, 5, 124:1-10, 2009.
- SalientShape: Group Saliency in Image Collections. Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Shi-Min Hu. The Visual Computer, 2013
- PoseShop: Human Image Database Construction and Personalized Content Synthesis. Tao Chen, Ping Tan, Li-Qian Ma, Ming-Ming Cheng, Ariel Shamir, Shi-Min Hu. IEEE TVCG, 19(5), 824-837, 2013.
Internet visual media processing: a survey with graphics and vision applications, Tao Chen, Ping Tan, Li-Qian Ma, Ming-Ming Cheng, Ariel Shamir, Shi-Min Hu. The Visual Computer, 2013, 1-13.
- Image editing: Semantic Colorization with Internet Images, Yong Sang Chia, Shaojie Zhuo, Raj Kumar Gupta, Yu-Wing Tai, Siu-Yeung Cho, Ping Tan, Stephen Lin, ACM SIGGRAPH Asia. 2011.
- View selection: Web-Image Driven Best Views of 3D Shapes. The Visual Computer, 2011. Accepted. H Liu, L Zhang, H Huang
- Image Collage: Arcimboldo-like Collage Using Internet Images.ACM SIGGRAPH Asia, 30(6), 2011. H Huang, L Zhang, HC Zhang
- Image manipulation: Data-Driven Object Manipulation in Images. Chen Goldberg, Eurographics 2012, T Chen, FL Zhang, A Shamir, SM Hu.
- Saliency For Image Manipulation, R. Margolin, L. Zelnik-Manor, and A. Tal, Computer Graphics International (CGI) 2012.
- Mobile Product Search with Bag of Hash Bits and Boundary Reranking, Junfeng He, Xianglong Liu, Tao Cheng, Jinyuan Feng, Tai-Hsu Lin, Hyunjin Chung and Shih-Fu Chang, IEEE CVPR, 2012.
- Unsupervised Object Discovery via Saliency-Guided Multiple Class Learning, Jun-Yan Zhu, Jiajun Wu, Yichen Wei, Eric Chang, and Zhuowen Tu, IEEE CVPR, 2012.
- Saliency Detection via Divergence Analysis: A Unified Perspective, ICPR 2012 (Best student paper). (The authors of this ICPR paper have derived that our formulation on global saliency has a deep connection with an information-theoretic measure, the so called Cauchy-Schwarz divergence.)
- Much more: http://scholar.google.com/scholar?cites=9026003219213417480
Q4: I’m confused about the definition of saliency. Why the annotation format (isolated points, binary mask regions, and bounding boxes) in different benchmarks for evaluating saliency detection methods are so different?
There are 3 different saliency detection directions: i) fixation prediction, ii) salient object detection, iii) objectness estimation. They have very different research target and very different applications. Personally, I’m mainly interested in the last two problems and will discuss them in a bit more detail.
Eye fixation models aims at predicting where human looks, i.e. a small set of fixation points. The most famous method in this area is Itti’s work in PAMI 1998. The MIT benchmark is designed for evaluating such methods.
Salient object detection, as what is done in this work, aim at finding most salient object in a scene and segment the whole extent of that object. The output is typically a single saliency map (or figure-ground segmentation). The advantages and disadvantages are described in detail in Q3. High precision is a major focus of our work, as we can use shape matching based technique to effectively select good segmentations and build robust applications on top. Most widely used benchmark for evaluating this problem is MSRA1000, which precisely segment 1000 salient objects in MSRA images. Our method achieves 93% precision and 90% recall on MSRA1000 (previous best reported results: 75% precision and 83% recall). Since our results on MSRA100 are mostly comparable to ground truth annotations, we need more challenging benchmark. MSRA10K and THUR15K are built for this purpose.
Objectness estimation is another attractive direction. These methods aim at proposing a small set (typically 1000) of bounding boxes to improve efficiency of classical sliding window pipeline. High recallat a small set of bounding box proposals is a major target. PASCAL VOC is a standard dataset for evaluating this problem. Using purely bottom up data driven methods to produce a single saliency map, as what is done in most salient object detection model, is less likely to succeed in this very challenging dataset. State of the art objectness proposal methods (PAMI12, IJCV13) achieves 90+% recall on challenging PASCAL VOC dataset given a relatively small (e.g. 1000) number of bounding boxes, while been computational efficient (4 seconds per image). This is especially useful for speed up multi-class object detection problem, as each classifier only need to examine a much smaller number of image windows (e.g. 1,000,000 -> 1,000).
Q5: In nearly all 300+ papers citing this work, the F-Measure of RC method used for comparison is significantly lower than that is reported in this paper. Why?
Our salient object segmentation involves a powerful SaliencyCut method, for which we have not yet release the source code (will be released only after the journal version been published). The high performance of our salient object segmentation method could simply be verified by running our published binary code. When reporting the F-Measure of our method, most papers use adaptive threshold to get segmentation results, which produce much worse results than our original version. This is somehow reasonable and make the comparison easier, as they don’t have access to our SaliencyCut code. Notice that our method achieves 92% F-Measure on MSRA benchmark, and I have not yet see any other method get F-Measure better than 90% (achieved by our CVPR11 version). It’s worth mentioning that even latest GrabCut method only achieves ‘comparable’ performance (F-Measure – 89%) on the same benchmark (see “Grabcut in One Cut, Meng Tang, Lena Gorelick, Olga Veksler, Yuri Boykov, ICCV, 2013″).
Q6: The benchmarks you use all have center bias, will this be a problem?
Regarding to the center bias, this seems to be a nature bias in real-world images. In the community of salient object detection, most methods tries to detect the most dominate object rather than dealing with complicated images, where many objects exist and have complicated occlusions, etc. Even (only) dealing with these simple (‘Flickr like’) images is also quite useful for many applications (see Q3). Even trained on thousands of accurately labeled images, state of the art object detection methods still can’t get robust results for PASCAL VOC like images. For salient object detection algorithms, the robustness could come from automatic selection of good results from thousands of images, for which we can get automatic segmentation results for free (no needs for training data annotation). See ‘SalientShape: Group Saliency in Image Collections’ for un-selected and automatic downloaded Flickr images dataset (also have clear center bias) as well as aforementioned applications.
Links to source code of other methods
|FT|| R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk,“Frequency-tuned salient region detection,” in IEEE CVPR, 2009, pp. 1597–1604.|
|AIM|| N. Bruce and J. Tsotsos, “Saliency, attention, and visual search: An information theoretic approach,” Journal of Vision, vol. 9, no. 3, pp. 5:1–24, 2009.|
|MSS|| R. Achanta and S. S ¨ usstrunk, “Saliency detection using maximum symmetric surround,” in IEEE ICIP, 2010, pp. 2653–2656.|
|SEG|| E. Rahtu, J. Kannala, M. Salo, and J. Heikkila, “Segmenting salient objects from images and videos,” ECCV, pp. 366–379, 2010.|
|SeR|| H. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance,” Journal of vision, vol. 9, no. 12, pp. 15:1–27, 2009.|
|SUN|| L. Zhang, M. Tong, T. Marks, H. Shan, and G. Cottrell, “SUN: A bayesian framework for saliency using natural statistics,” Journal of Vision, vol. 8, no. 7, pp. 32:1–20, 2008.|
|SWD|| L. Duan, C. Wu, J. Miao, L. Qing, and Y. Fu, “Visual saliency detection by spatially weighted dissimilarity,” in IEEE CVPR, 2011, pp. 473–480.|
|IM|| N. Murray, M. Vanrell, X. Otazu, and C. A. Parraga, “Saliency estimation using a non-parametric low-level vision model,” in IEEE CVPR, 2011, pp. 433–440.|
|IT|| L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE TPAMI, vol. 20, no. 11, pp. 1254–1259, 1998.|
|GB|| J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in NIPS, 2007, pp. 545–552.|
|SR|| X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,” in IEEE CVPR, 2007, pp. 1–8.|
|CA|| S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware saliency detection,” in IEEE CVPR, 2010, pp. 2376–2383.|
|LC|| Y. Zhai and M. Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in ACM Multimedia, 2006, pp. 815–824.|
|AC|| R. Achanta, F. Estrada, P. Wils, and S. S ¨ usstrunk, “Salient region detection and segmentation,” in IEEE ICVS, 2008, pp. 66–75.|
|CB|| H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li,“Automatic salient object segmentation based on context and shape prior,” in British Machine Vision Conference, 2011, pp. 1–12.|
|LP|| T. Judd, K. Ehinger, F. Durand, A Torralba, Learning to predict where humans look, ICCV 2009.|