semi-supervised Learning
Surveys
An Overview of Deep Semi-Supervised Learning
Semi-supervised learning
mixmatch [4]: gracefully unify data augmentation, sharpening (low entropy), mixup.
co-training for semi-supervised Learning
Reference
[1] Deep Co-Training for Semi-Supervised Image Recognition
[2] Tri-net for Semi-Supervised Deep Learning
[3] Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition
[4] Berthelot, David, et al. “Mixmatch: A holistic approach to semi-supervised learning.” arXiv preprint arXiv:1905.02249 (2019).
[5] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le, “Unsupervised Data Augmentation for Consistency Training.” arXiv preprint arXiv:1904.12848 (2019).
Self-supervised Learning
Design a proxy task using unlabeled or weakly-labeled data to help the original task. Essentially, self-supervised learning is multi-task learning with the proxy task not relying on heavy human annotation. The problem is which proxy task without human annotation is the most effective one.
Please refer to the tutorial slides [1] [2], the survey, and the paper list.
image-to-image
- image-to-image translation: colorization [1], inpainting [2], cross-channel generation [3]
- spatial location: relative location [1], jigsaw [2], predicting rotation [3]
- contrastive learning: instance-wise contrastive learning (e.g., MOCO), prototypical contrastive learning (clustering) [1] [2]
- MAE: Siamese MAE
video-to-image
- temporal coherence: [1] [2] [3]
- temporal order: [1] [2] [3]
- unsupervised image tasks with video clues: clustering [1], optical flow prediction [1], unsupervised segmentation based on optical flow [1] [2],unsupervised depth estimation based on optical flow [2]
- video generation [1]
- cross-modal consistency: consistency between visual kernel and optical flow kernel [1]
video-to-video: all video-to-image methods can be used for video-to-video by averaging frame features.
Muti-task self-supervised learning: integrate multiple proxy tasks [1] [2]
Combined with other frameworks: self-supervised GAN [1]
A recent paper [1*] claims that the best self-supervised learning method is still the earliest image inpainting model. The design of network architecture has a significant impact on the performance of self-supevivsed learning methods.
SimCLR [2*] is a SOTA self-supervised learning method with performance approaching supervised learning.
Reference
[1*] Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer: Revisiting Self-Supervised Visual Representation Learning. CVPR 2019.
[2*] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” arXiv preprint arXiv:2002.05709 (2020).
Repeated Patterns
detect repeated patterns [1]
inpaint corrupted images with repeated patterns [2]: use frequency convolution
Reference
[1] Louis Lettry, Michal Perdoch, Kenneth Vanhoey, and Luc Van Gool. Repeated pattern detection using cnn activations. In WACV, 2017
[2] Suvorov, Roman, et al. “Resolution-robust Large Mask Inpainting with Fourier Convolutions.” arXiv preprint arXiv:2109.07161 (2021).
Receptive Field

Real receptive field is smaller than theorical receptive field, and shrinks by $\frac{1}{\sqrt{n}}$ with $n$ being the number of layers.
Advanced networks (e.g., ResNet) have larger receptive field than old networks (e.g., AlexNet). In latest networks, the receptive field of each pixel in the last layer is as large as the whole image. Generally, larger receptive field leads to higher accuracy, but is not the only factor that influences the accuracy.
Fomoro: a website to calculate receptive field.
Distill: mathematical derivations and open-source library to compute receptive field.
Reference
- Wenjie Luo, Yujia Li, Raquel Urtasun, Richard S. Zemel:
Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. NIPS, 2016.
Prompt for Vision
Privileged Information
Learning Using Privileged Information (LUPI) or SVM+ was proposed by Vapnik in [the first paper].
High-level ideas:
- Use privileged information in the same way as for multi-view learning
- Transfer between privileged information and primary information
- Use privileged information to control the training process like training uncertainty or training difficulty (e.g., training loss, noise).
Applications:
SVM for binary classification
Gaussian process classification
- GPC [1]
L2 loss for classification/Hash
clustering
- clustering [1]
metric learning for verification/classification
CRF
- probilistic inference [1]: similar with multi-view, but integral over the latent privileged information space during testing
random forest
- conditional regression forest [1]: design node splitting criterion
matrix factorization for collaborative filtering
- PriMF [1]
Maximum Entropy Discrimination
- MED [1]
Deep Learning
- Hallucination network
- classification loss [1]
- model drop-out [1]
Settings:
Non-local Network
Extensions of non-local network [1]: [2] [3] [4] [5]
Reference
[1] Wang, Xiaolong, et al. “Non-local neural networks.” CVPR, 2018.
[2] Zhu, Zhen, et al. “Asymmetric non-local neural networks for semantic segmentation.” ICCV, 2019.
[3] Li, Xia, et al. “Expectation-maximization attention networks for semantic segmentation.” ICCV, 2019.
[4] Huang, Zilong, et al. “Ccnet: Criss-cross attention for semantic segmentation.” ICCV, 2019.
[5] Zhang, Li, et al. “Dynamic graph message passing networks.” CVPR, 2020.