• standard semantic segmentation: [1] [2] [3]

  • binary segmentation: [4]

Reference

[1] Rohan Doshi, Olga Russakovsky, “zero-shot semantic segmentation”, bachelor thesis.

[2] Y. Xian, S. Choudhury, Y. He, B. Schiele and Z. Akata , “SPNet: Semantic Projection Network for Zero- and Few-Label Semantic Segmentation”, CVPR, 2019.

[3] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, Sorbonne, Patrick Pérez, “Zero-Shot Semantic Segmentation”, 2019

[4] Kato, Naoki, Toshihiko Yamasaki, and Kiyoharu Aizawa. “Zero-Shot Semantic Segmentation via Variational Mapping.” ICCV Workshops. 2019.

  • ZSL based on bounding box features

    • [1]: use background bounding boxes from background classes
    • [3]: classification loss with semantic clustering
  • End-to-end zero-shot object detection

    • [2]: extend YOLO, concatenate three feature maps to predict confidence score.
    • [4]: use polarity loss similar to focal loss and vocabulary to enhance word vector
    • [5]: output both classification scores and semantic embeddings
  • Feature generation

    • [6]: synthesize
      visual features for unseen classes

    • [7]: semantics-preserving graph propagation modules that enhance both category and region representations

Reference

[1] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, Ajay Divakaran, “Zero-Shot Object Detection”, ECCV, 2018.

[2] Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama, “Zero Shot Detection”, T-CSVT, 2019.

[3] Rahman, Shafin, Salman Khan, and Fatih Porikli. “Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts.” arXiv preprint arXiv:1803.06049 (2018).

[4] Rahman, Shafin, Salman Khan, and Nick Barnes. “Polarity Loss for Zero-shot Object Detection.” arXiv preprint arXiv:1811.08982 (2018).

[5] Demirel, Berkan, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. “Zero-Shot Object Detection by Hybrid Region Embedding.” arXiv preprint arXiv:1805.06157 (2018).

[6] Hayat, Nasir, et al. “Synthesizing the unseen for zero-shot object detection.” Proceedings of the Asian Conference on Computer Vision. 2020.

[7] Yan, Caixia, et al. “Semantics-preserving graph propagation for zero-shot object detection.” IEEE Transactions on Image Processing 29 (2020): 8163-8176.

Zero-shot learning focuses on the relation between visual features X, semantic embeddings A, and category labels Y. Based on the approach, existing zero-shot learning works can be roughly categorized into the following groups:

1) semantic relatedness: X->Y (semantic similarity; write classifier)

2) semantic embedding: X->A->Y (map from X to A; map from A to X; map between A and X into common space)

Based on the setting, existing zero-shot learning works can be roughly categorized into the following groups:

1) inductive ZSL (do not use unlabeled test images in the training stage) v.s. semi-supervised/transductive ZSL (use unlabeled test images in the training stage)

2) standard ZSL (test images only from unseen categories) v.s. generalized ZSL (test images from both seen and unseen categories) (novelty detection, calibrated stacking)

Ideas:

  1. Mapping: dictionary learning, metric learning, etc

  2. Embedding: multiple embedding [1], free embedding [1], self-defined embedding [1]

  3. Application: video->object(attribute)->action [1], image->object(attribute)->scene

  4. Combination: with active learning [1] [2], online learning [1]

  5. External knowledge graph: WordNet-based [1], NELL-based [2]

  6. Deep learning: graph neural network [1], RNN [2]

  7. Generate synthetic exemplars for unseen categories: synthetic images [SP-AEN] or synthetic features [SE-ZSL] [GAZSL] [f-xGAN]

Critical Issues:

  1. generalized ZSL, why first predict seen or unseen?: As claimed in [1], since we only see labeled data from seen classes, during training, the scoring functions of seen classes tend to dominate those of unseen classes, leading to biased predictions in GZSL and aggressively classifying a new data point into the label space of S because classifiers for the seen classes do not get trained on negative examples from the unseen classes.

  2. hubness problem [1][2]: As claimed in [2], one practical effect of the ZSL domain shift is the Hubness problem. Specifically, after the domain shift, there are a small set of hub test-class prototypes that become nearest or K nearest neighbours to the majority of testing samples in the semantic space, while others are NNs of no testing instances. This results in poor accuracy and highly biased predictions with the majority of testing examples being assigned to a small minority of classes.

  3. projection domain shift: what is the impact on the decision values?

Datasets:

  1. small-scale datasets: CUB, AwA, SUN, aPY, Dogs, FLO

  2. large-scale dataset: ImageNet

Survey and Resource:

  1. Recent Advances in Zero-Shot Recognition

  2. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly [code]

  3. List of paper and datasets

Other applications:

  1. zero-shot object detection
  2. zero-shot figure-ground segmentation [1]
  3. zero-shot semantic segmentation
  4. zero-shot retrieval
  5. zero-shot domain adaptation

  • Using web data for semantic segmentation:

    • [1]: crawl web images with white background and generate composite images to initialize segmentation network

    • [2]: train a segmentation network using web data to obtain rough segmentation mask

  • image-level semantic/instance segmentation: [9] [10] [11]

  • box-level semantic/instance segmentation: [3] [4] [5] [6]

  • scribble/point-level semantic segmentaiton: [7] [8] [12] [13] [14] [15] [16] [17] [18] [19]

Reference

[1] Bin Jin, Maria V. Ortiz Segovia, Sabine Süsstrunk:
Webly Supervised Semantic Segmentation. CVPR 2017.

[2] Tong Shen, Guosheng Lin, Chunhua Shen, Ian D. Reid:
Bootstrapping the Performance of Webly Supervised Semantic Segmentation. CVPR 2018.

[3] Dai, Jifeng, Kaiming He, and Jian Sun. “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation.” ICCV, 2015.

[4] Khoreva, Anna, et al. “Simple does it: Weakly supervised instance and semantic segmentation.” CVPR, 2017.

[5] Ahn, Jiwoon, Sunghyun Cho, and Suha Kwak. “Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations.” CVPR, 2019.

[6] Hsu, Cheng-Chun, et al. “Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior.” NeurIPS. 2019.

[7] Lin, Di, et al. “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation.” CVPR, 2016.

[8] Bearman, Amy, et al. “What’s the point: Semantic segmentation with point supervision.” ECCV, 2016.

[9] Zhu, Yi, et al. “Learning instance activation maps for weakly supervised instance segmentation.” CVPR, 2019.

[10] Wang, Xiang, et al. “Weakly-Supervised Semantic Segmentation by Iterative Affinity Learning.” International Journal of Computer Vision (2020): 1-14.

[11] Jo, Sanhyun, and In-Jae Yu. “Puzzle-CAM: Improved localization via matching partial and full features.” arXiv preprint arXiv:2101.11253 (2021).

[12] Tang, Meng, et al. “Normalized cut loss for weakly-supervised cnn segmentation.” CVPR, 2018.

[13] Tang, Meng, et al. “On regularized losses for weakly-supervised cnn segmentation.” ECCV, 2018.

[14] Marin, Dmitrii, et al. “Beyond gradient descent for regularized segmentation losses.” CVPR, 2019.

[15] Wang, Bin, et al. “Boundary perception guidance: A scribble-supervised semantic segmentation approach.” IJCAI, 2019.

[16] Pan, Zhiyi, et al. “Scribble-supervised semantic segmentation by uncertainty reduction on neural representation and self-supervision on neural eigenspace.” ICCV, 2021.

[17] Xu, Jingshan, et al. “Scribble-supervised semantic segmentation inference.” ICCV, 2021.

[18] Chen, Hongjun, et al. “Seminar learning for click-level weakly supervised semantic segmentation.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[19] Liang, Zhiyuan, et al. “Tree energy loss: Towards sparsely annotated semantic segmentation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

  1. webly supervised object detection [1]

  2. use a few bounding box annotations and a large number of image label annotations [2]

Reference

[1] Exploiting Web Images for Weakly Supervised Object Detection. IEEE Trans. Multimedia 21(5): 1135-1146 (2019)

[2] DLWL: Improving Detection for Lowshot classes with Weakly Labelled data, Vignesh Ramanathan, Rui Wang, Dhruv Mahajan, CVPR2020

Closely related to weakly-supervised segmentation.

Reference

[1] Zhang, Xiaolin, et al. “Adversarial complementary learning for weakly supervised object localization.” CVPR, 2018.

[2] Zhang, Xiaolin, et al. “Self-produced guidance for weakly-supervised object localization.” ECCV, 2018.

Two problems:

  • Label noise: label flip noise (belong to other training categories) and outlier noise (does not belong to any training category).
  • Domain shift: domain distribution mismatch between web data and consumer data.

Solutions:

  1. label flip layer: [1] [2] [3]

  2. multi-instance learning: [4] (pixel-level attention) [5] [6] [19](image-level attention)

  3. reweight training samples: [7] [8] [9]

  4. curriculumn learning: [10] [11]

  5. bootstrapping: [12]

  6. negative learning: [18]

  7. Cyclical Training: [20]

Use auxiliary clean data:

  1. active learning (select training samples to annotate): [13]

  2. reinforcement learning (learn labeling policies): [14]

  3. analogous to semi-supervised learning

    • partial data with both noisy labels and clean labels as well as partial data with only noisy labels [15] [3] [7]
    • partial data with noisy labels and partial data with clean labels [16] [17]

Datasets:

There are two types of label noise: synthetic label noise and web label noise.

Surveys:

Reference

[1] Chen, Xinlei, and Abhinav Gupta. “Webly supervised learning of convolutional networks.” ICCV, 2015.

[2] Sukhbaatar, Sainbayar, et al. “Training convolutional networks with noisy labels.” arXiv preprint arXiv:1406.2080 (2014).

[3] Xiao, Tong, et al. “Learning from massive noisy labeled data for image classification.” CVPR, 2015.

[4] Zhuang, Bohan, et al. “Attend in groups: a weakly-supervised deep learning framework for learning from web data.” CVPR, 2017.

[5] Wu, Jiajun, et al. “Deep multiple instance learning for image classification and auto-annotation.” CVPR, 2015.

[6] Ilse, Maximilian, Jakub M. Tomczak, and Max Welling. “Attention-based deep multiple instance learning.” arXiv preprint arXiv:1802.04712 (2018).

[7] Lee, Kuang-Huei, et al. “Cleannet: Transfer learning for scalable image classifier training with label noise.” CVPR, 2018.

[8] Liu, Tongliang, and Dacheng Tao. “Classification with noisy labels by importance reweighting.” T-PAMI, 2015.

[9] Misra, Ishan, et al. “Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels.” CVPR, 2016.

[10] Guo, Sheng, et al. “Curriculumnet: Weakly supervised learning from large-scale web images.” ECCV, 2018.

[11] Jiang, Lu, et al. “Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels.” arXiv preprint arXiv:1712.05055 (2017).

[12] Reed, Scott, et al. “Training deep neural networks on noisy labels with bootstrapping.” arXiv preprint arXiv:1412.6596 (2014).

[13] Krause, Jonathan, et al. “The unreasonable effectiveness of noisy data for fine-grained recognition.” ECCV, 2016.

[14] Yeung, Serena, et al. “Learning to learn from noisy web videos.” CVPR, 2017.

[15] Veit, Andreas, et al. “Learning from noisy large-scale datasets with minimal supervision.” CVPR, 2017.

[16] Xu, Zhe, et al. “Webly-supervised fine-grained visual categorization via deep domain adaptation.” T-PAMI, 2016.

[17] Li, Yuncheng, et al. “Learning from noisy labels with distillation.” ICCV, 2017.

[18] Kim, Youngdong, et al. “Nlnl: Negative learning for noisy labels.” ICCV, 2019.

[19] “MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition”, CVPR, 2019.

[20] Huang, Jinchi, et al. “O2u-net: A simple noisy label detection approach for deep neural networks.” ICCV, 2019.

Weak-shot object detection is also called cross-supervised or mixed-supervised object detection. Specifically, all categories are splitted into base categories and novel categories. Base categories have box-level annotation while novel categories only have image-level annotations.

  • Transfer common objectness: [1] [3]

  • Transfer the mapping from inaccurate bounding boxes to accurate bounding boxes: [2]

Reference

[1] Zhong, Yuanyi, et al. “Boosting weakly supervised object detection with progressive knowledge transfer.” European Conference on Computer Vision. Springer, Cham, 2020.

[2] Chen, Zitian, et al. “Cross-Supervised Object Detection.” arXiv preprint arXiv:2006.15056 (2020).

[3] Li, Yan, et al. “Mixed supervised object detection with robust objectness transfer.” IEEE transactions on pattern analysis and machine intelligence 41.3 (2018): 639-653.

Given the segmentation mask of the first frame of a video clip, predict the segmentation masks in the subsequent frames.

  1. Davis challenge https://davischallenge.org/ held since 2017, related papers [1] [2]

  2. YouTube-VOS: A Large-Scale Benchmark for Video Object Segmentation https://youtube-vos.org/home

  3. GyGO: an E-commerce Video Object Segmentation Dataset by Visualead https://github.com/ilchemla/gygo-dataset

Reference:

  1. Perazzi, Federico, et al. “A benchmark dataset and evaluation methodology for video object segmentation.” CVPR, 2016.

  2. Pont-Tuset, Jordi, et al. “The 2017 davis challenge on video object segmentation.” arXiv preprint arXiv:1704.00675 (2017).

  • Track-by-Detect: MaskTrack R-CNN [1]

  • Clip-Match: Vistr [2]

  • Propose-Reduce: [3]

Reference

[1] Yang, Linjie, Yuchen Fan, and Ning Xu. “Video instance segmentation.” ICCV, 2019.

[2] Wang, Yuqing, et al. “End-to-end video instance segmentation with transformers.” CVPR, 2021.

[3] Lin, Huaijia, et al. “Video instance segmentation with a propose-reduce paradigm.” ICCV, 2021.

0%