Design a proxy task using unlabeled or weakly-labeled data to help the original task. Essentially, self-supervised learning is multi-task learning with the proxy task not relying on heavy human annotation. The problem is which proxy task without human annotation is the most effective one.

Please refer to the tutorial slides [1] [2], the survey, and the paper list.

  1. image-to-image

    • image-to-image translation: colorization [1], inpainting [2], cross-channel generation [3]
    • spatial location: relative location [1], jigsaw [2], predicting rotation [3]
    • contrastive learning: instance-wise contrastive learning (e.g., MOCO), prototypical contrastive learning (clustering) [1] [2]
    • MAE: Siamese MAE
  1. video-to-image

    • temporal coherence: [1] [2] [3]
    • temporal order: [1] [2] [3]
    • unsupervised image tasks with video clues: clustering [1], optical flow prediction [1], unsupervised segmentation based on optical flow [1] [2],unsupervised depth estimation based on optical flow [2]
    • video generation [1]
    • cross-modal consistency: consistency between visual kernel and optical flow kernel [1]
  2. video-to-video: all video-to-image methods can be used for video-to-video by averaging frame features.

    • 3D rotation [1]
    • Cubic puzzle [1]
    • video localization and classification [1]

Muti-task self-supervised learning: integrate multiple proxy tasks [1] [2]

Combined with other frameworks: self-supervised GAN [1]

A recent paper [1*] claims that the best self-supervised learning method is still the earliest image inpainting model. The design of network architecture has a significant impact on the performance of self-supevivsed learning methods.

SimCLR [2*] is a SOTA self-supervised learning method with performance approaching supervised learning.

Reference

[1*] Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer: Revisiting Self-Supervised Visual Representation Learning. CVPR 2019.

[2*] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” arXiv preprint arXiv:2002.05709 (2020).

  1. ROI pooling
  2. ROI alignment [1]
  3. Precise RoI Pooling [2]

Reference

  1. He, Kaiming, et al. “Mask R-CNN.” ICCV, 2017.
  2. Jiang, Borui, et al. “Acquisition of localization confidence for accurate object detection.” ECCV, 2018.

  1. detect repeated patterns [1]

  2. inpaint corrupted images with repeated patterns [2]: use frequency convolution

Reference

[1] Louis Lettry, Michal Perdoch, Kenneth Vanhoey, and Luc Van Gool. Repeated pattern detection using cnn activations. In WACV, 2017

[2] Suvorov, Roman, et al. “Resolution-robust Large Mask Inpainting with Fourier Convolutions.” arXiv preprint arXiv:2109.07161 (2021).

  1. Real receptive field is smaller than theorical receptive field, and shrinks by $\frac{1}{\sqrt{n}}$ with $n$ being the number of layers.

  2. Advanced networks (e.g., ResNet) have larger receptive field than old networks (e.g., AlexNet). In latest networks, the receptive field of each pixel in the last layer is as large as the whole image. Generally, larger receptive field leads to higher accuracy, but is not the only factor that influences the accuracy.

Fomoro: a website to calculate receptive field.

Distill: mathematical derivations and open-source library to compute receptive field.

Reference

  1. Wenjie Luo, Yujia Li, Raquel Urtasun, Richard S. Zemel:
    Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. NIPS, 2016.

  • Prompt for image-to-image translation: [1]

  • Prompt for visual grounding: [2]

References

[1] Bar, Amir, et al. “Visual Prompting via Image Inpainting.” arXiv preprint arXiv:2209.00647 (2022).

[2] Yao, Yuan, et al. “Cpt: Colorful prompt tuning for pre-trained vision-language models.” arXiv preprint arXiv:2109.11797 (2021).

Learning Using Privileged Information (LUPI) or SVM+ was proposed by Vapnik in [the first paper].

High-level ideas:

  • Use privileged information in the same way as for multi-view learning
  • Transfer between privileged information and primary information
  • Use privileged information to control the training process like training uncertainty or training difficulty (e.g., training loss, noise).

Applications:

  • SVM for binary classification

    • model the slack variable : SVM+ [1]
    • model the margin: [1] [2]
    • structural SVM: [1]
    • theoretical analysis: [1] [2]
  • Gaussian process classification

  • L2 loss for classification/Hash

    • multi-labeling [1]
    • Hash ITQ [1]
  • clustering

    • clustering [1]
  • metric learning for verification/classification

  • CRF

    • probilistic inference [1]: similar with multi-view, but integral over the latent privileged information space during testing
  • random forest

    • conditional regression forest [1]: design node splitting criterion
  • matrix factorization for collaborative filtering

  • Maximum Entropy Discrimination

  • Deep Learning

Settings:

  • multi-view + LUPI [1]
  • multi-task multi-class LUPI [1]
  • multi-instance LUPI [1]
  • active learning + LUPI [1]
  • distillation + LUPI [1]
  • domain adaptation + LUPI [1]

Extensions of non-local network [1]: [2] [3] [4] [5]

Reference

[1] Wang, Xiaolong, et al. “Non-local neural networks.” CVPR, 2018.

[2] Zhu, Zhen, et al. “Asymmetric non-local neural networks for semantic segmentation.” ICCV, 2019.

[3] Li, Xia, et al. “Expectation-maximization attention networks for semantic segmentation.” ICCV, 2019.

[4] Huang, Zilong, et al. “Ccnet: Criss-cross attention for semantic segmentation.” ICCV, 2019.

[5] Zhang, Li, et al. “Dynamic graph message passing networks.” CVPR, 2020.

  1. Transformer

  2. Large kernel: [1] [2] [3]

Reference

[1] Liu, Zhuang, et al. “A convnet for the 2020s.” CVPR, 2022.

[2] Ding, Xiaohan, et al. “Scaling up your kernels to 31x31: Revisiting large kernel design in cnns.” CVPR, 2022.

[3] More ConvNets in the 2020s: Scaling up Kernels Beyond 51 × 51 using Sparsity

Reference

[1] Mildenhall, Ben, et al. “Nerf: Representing scenes as neural radiance fields for view synthesis.” ECCV, 2020.

[2] Niemeyer, Michael, and Andreas Geiger. “Giraffe: Representing scenes as compositional generative neural feature fields.” CVPR, 2021.

  1. Concatenation/summation, or weighted (attention mechanism) concatenation/summation.
  2. P(y|x1,x2)=P(y|x1)P(y|x2), with Gaussian distribution assumption [1]

Reference

  1. Huang, Xun, et al. “Multimodal Conditional Image Synthesis with Product-of-Experts GANs.” arXiv preprint arXiv:2112.05130 (2021).
0%