Newly Blog

Spatial Transformation

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

parametric transform (affine transformation, thin-plate translation, etc): STN [2], hierarchical STN [5], deformable style transfer [10]
learn distortion grid [1] [6] [7]
learn conv offset: Deformable CNN v1[3], v2[4], deformable kernel [9]
optical flow: [8]
swap disentangled geometry-relevant feature
move keypoints: transGAGA [11]

Reference

[1] Recasens, Adria, et al. “Learning to zoom: a saliency-based sampling layer for neural networks.” ECCV, 2018.

[2] Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” NIPS, 2015.

[3] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei:
Deformable Convolutional Networks. ICCV 2017.

[4] Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai: Deformable ConvNets v2: More Deformable, Better Results. CoRR abs/1811.11168 (2018)

[5] Shu, Chang, et al. “Hierarchical Spatial Transformer Network.” arXiv preprint arXiv:1801.09467 (2018).

[6] Zheng, Heliang, et al. “Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition.” CVPR, 2019.

[7] Marin, Dmitrii, et al. “Efficient segmentation: Learning downsampling near semantic boundaries.” ICCV, 2019.

[8] Ren, Yurui, et al. “Deep Image Spatial Transformation for Person Image Generation.”, CVPR, 2020.

[9] Gao, Hang, et al. “Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation.” arXiv preprint arXiv:1910.02940 (2019).

[10] Kim, Sunnie SY, et al. “Deformable Style Transfer.” arXiv preprint arXiv:2003.11038 (2020).

[11] Wayne Wu, Kaidi Cao, Cheng Li, Chen Qian, Chen Change Loy: TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation. CVPR 2019

Smoothness Loss

Posted on 2026-03-17 Edited on 2023-06-12 In paper note

Total Variation (TV) loss
Poisson blending loss [1] [2]
Gradient loss [1]
Laplacian loss [1] [2]

semi-supervised Learning

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Surveys

An Overview of Deep Semi-Supervised Learning

Semi-supervised learning

mixmatch [4]: gracefully unify data augmentation, sharpening (low entropy), mixup.
unsupervised data augmentation [5] [code]

co-training for semi-supervised Learning

multi-view: co-training [1], tri-net [2]
multi-graph: label propagation [3]

Reference

[1] Deep Co-Training for Semi-Supervised Image Recognition

[2] Tri-net for Semi-Supervised Deep Learning

[3] Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition

[4] Berthelot, David, et al. “Mixmatch: A holistic approach to semi-supervised learning.” arXiv preprint arXiv:1905.02249 (2019).

[5] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le, “Unsupervised Data Augmentation for Consistency Training.” arXiv preprint arXiv:1904.12848 (2019).

Self-supervised Learning

Posted on 2026-03-17 Edited on 2023-06-02 In paper note

Design a proxy task using unlabeled or weakly-labeled data to help the original task. Essentially, self-supervised learning is multi-task learning with the proxy task not relying on heavy human annotation. The problem is which proxy task without human annotation is the most effective one.

Please refer to the tutorial slides [1] [2], the survey, and the paper list.

image-to-image
- image-to-image translation: colorization [1], inpainting [2], cross-channel generation [3]
- spatial location: relative location [1], jigsaw [2], predicting rotation [3]
- contrastive learning: instance-wise contrastive learning (e.g., MOCO), prototypical contrastive learning (clustering) [1] [2]
- MAE: Siamese MAE

video-to-image
- temporal coherence: [1] [2] [3]
- temporal order: [1] [2] [3]
- unsupervised image tasks with video clues: clustering [1], optical flow prediction [1], unsupervised segmentation based on optical flow [1] [2],unsupervised depth estimation based on optical flow [2]
- video generation [1]
- cross-modal consistency: consistency between visual kernel and optical flow kernel [1]
video-to-video: all video-to-image methods can be used for video-to-video by averaging frame features.
- 3D rotation [1]
- Cubic puzzle [1]
- video localization and classification [1]

Muti-task self-supervised learning: integrate multiple proxy tasks [1] [2]

Combined with other frameworks: self-supervised GAN [1]

A recent paper [1*] claims that the best self-supervised learning method is still the earliest image inpainting model. The design of network architecture has a significant impact on the performance of self-supevivsed learning methods.

SimCLR [2*] is a SOTA self-supervised learning method with performance approaching supervised learning.

Reference

[1*] Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer: Revisiting Self-Supervised Visual Representation Learning. CVPR 2019.

[2*] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” arXiv preprint arXiv:2002.05709 (2020).

ROI Feature

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

ROI pooling
ROI alignment [1]
Precise RoI Pooling [2]

Reference

He, Kaiming, et al. “Mask R-CNN.” ICCV, 2017.
Jiang, Borui, et al. “Acquisition of localization confidence for accurate object detection.” ECCV, 2018.

Repeated Patterns

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

detect repeated patterns [1]
inpaint corrupted images with repeated patterns [2]: use frequency convolution

Reference

[1] Louis Lettry, Michal Perdoch, Kenneth Vanhoey, and Luc Van Gool. Repeated pattern detection using cnn activations. In WACV, 2017

[2] Suvorov, Roman, et al. “Resolution-robust Large Mask Inpainting with Fourier Convolutions.” arXiv preprint arXiv:2109.07161 (2021).

Receptive Field

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Real receptive field is smaller than theorical receptive field, and shrinks by $\frac{1}{\sqrt{n}}$ with $n$ being the number of layers.
Advanced networks (e.g., ResNet) have larger receptive field than old networks (e.g., AlexNet). In latest networks, the receptive field of each pixel in the last layer is as large as the whole image. Generally, larger receptive field leads to higher accuracy, but is not the only factor that influences the accuracy.

Fomoro: a website to calculate receptive field.

Distill: mathematical derivations and open-source library to compute receptive field.

Reference

Wenjie Luo, Yujia Li, Raquel Urtasun, Richard S. Zemel:
Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. NIPS, 2016.

Prompt for Vision

Posted on 2026-03-17 Edited on 2022-10-17 In paper note

Prompt for image-to-image translation: [1]
Prompt for visual grounding: [2]

References

[1] Bar, Amir, et al. “Visual Prompting via Image Inpainting.” arXiv preprint arXiv:2209.00647 (2022).

[2] Yao, Yuan, et al. “Cpt: Colorful prompt tuning for pre-trained vision-language models.” arXiv preprint arXiv:2109.11797 (2021).

Privileged Information

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Learning Using Privileged Information (LUPI) or SVM+ was proposed by Vapnik in [the first paper].

High-level ideas:

Use privileged information in the same way as for multi-view learning
Transfer between privileged information and primary information
Use privileged information to control the training process like training uncertainty or training difficulty (e.g., training loss, noise).

Applications:

SVM for binary classification
- model the slack variable : SVM+ [1]
- model the margin: [1] [2]
- structural SVM: [1]
- theoretical analysis: [1] [2]
Gaussian process classification
- GPC [1]
L2 loss for classification/Hash
- multi-labeling [1]
- Hash ITQ [1]
clustering
- clustering [1]
metric learning for verification/classification
- ITML+ [1] [2]
- DML+ [1]
- OITML [1]: ordinal-based ITML
CRF
- probilistic inference [1]: similar with multi-view, but integral over the latent privileged information space during testing
random forest
- conditional regression forest [1]: design node splitting criterion
matrix factorization for collaborative filtering
- PriMF [1]
Maximum Entropy Discrimination
- MED [1]
Deep Learning
- Hallucination network
- classification loss [1]
- model drop-out [1]

Settings:

multi-view + LUPI [1]
multi-task multi-class LUPI [1]
multi-instance LUPI [1]
active learning + LUPI [1]
distillation + LUPI [1]
domain adaptation + LUPI [1]

Non-local Network

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Extensions of non-local network [1]: [2] [3] [4] [5]

Reference

[1] Wang, Xiaolong, et al. “Non-local neural networks.” CVPR, 2018.

[2] Zhu, Zhen, et al. “Asymmetric non-local neural networks for semantic segmentation.” ICCV, 2019.

[3] Li, Xia, et al. “Expectation-maximization attention networks for semantic segmentation.” ICCV, 2019.

[4] Huang, Zilong, et al. “Ccnet: Criss-cross attention for semantic segmentation.” ICCV, 2019.

[5] Zhang, Li, et al. “Dynamic graph message passing networks.” CVPR, 2020.