Depthwise convolution: do not sum over all the channels. So when the number of input channels is $n_{in}$ and the number of filters is $n_{filter}$, the number of output channels is $n_{in}\times n_{filter}$.
Pointwise convolution: pointwise fully connected across all the channels.
Group convolution: divide channels into several groups and perform pointwise convolution within each group. Note that Pointwise convolution is the special case of group convolution when there is only one group. Group convolution is used in ShuffleNet.
Depthwise separable convolution = depthwise convolution + pointwise convolution
Dilation convolution or atrous convolution: increase the receptive field without increasing the number of parameters, typically used for segmentation.
Deformable convolution (left) and spatial transformation network (right): these two methods both belong to irregular convolution and tweak the coordinates on the input feature map. Deformable convolution learn sthe offset while spatial transformation network learns the affine transformation.
Squeeze-and-Excitation: learn different weights for each channel.
VAE
Advanced VAE
References
[1] Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. “Neural discrete representation learning.” arXiv preprint arXiv:1711.00937 (2017).
[2] Razavi, Ali, Aaron van den Oord, and Oriol Vinyals. “Generating diverse high-fidelity images with vq-vae-2.” Advances in neural information processing systems. 2019.
[3] Vahdat, Arash, and Jan Kautz. “Nvae: A deep hierarchical variational autoencoder.” arXiv preprint arXiv:2007.03898 (2020).
[4] Bond-Taylor, Sam, et al. “Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes.” arXiv preprint arXiv:2111.12701 (2021).
[5] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman, “MaskGIT: Masked Generative Image Transformer”, arXiv preprint arXiv:2202.04200.
Deconvolution Layer
unpooling
transpose convolution (learnable upsampling)
Tricky Back Propagation
Some operations are indifferentiable, which causes difficulties for back propagation.
sample from distribution: reparameterization trick
Auto-Encoding Variational Bayesargmax: soft argmax
Gradient Descent Optimization of Smoothed Information Retrieval Metricscrop: two-dimension boxcar function
Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognitiondelta function: lookup table+blur kernel
Reblur2Deblur: Deblurring Videos via Self-Supervised Learning
Transformer
Transformer [1] is multi-head self-attention, which can be used for sequence-to-sequence or sequence-to-label applications.
Applications
- classification: ViT[2] RVT [12]
- object detection: DETR[3]
- person reID: [7]
- general low-level vision (super-resolution, denoising, deraining) [4] [16]
- unconditional GAN: [10]
- video understanding: [11]
- image colorization: [6] [17]
- style transfer: [15]
Combination of transformer and CNN:
Interpretability
- beyond attention visualization [5]
Position embedding
Lightweight Transformer
- XFormer [13]
A summary of efficient Transformers
Survey
Reference
[1] Vaswani, Ashish, et al. “Attention is all you need.” NeurIPS, 2017.
[2] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
[3] Carion, Nicolas, et al. “End-to-End Object Detection with Transformers.” arXiv preprint arXiv:2005.12872 (2020).
[4] Chen, Hanting, et al. “Pre-Trained Image Processing Transformer.” arXiv preprint arXiv:2012.00364 (2020).
[5] Chefer, Hila, Shir Gur, and Lior Wolf. “Transformer Interpretability Beyond Attention Visualization.” arXiv preprint arXiv:2012.09838 (2020).
[6] Manoj Kumar, Dirk Weissenborn & Nal Kalchbrenner, “COLORIZATION TRANSFORMER”, ICLR, 2021.
[7] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang, “TransReID: Transformer-based Object Re-Identification”, arXiv preprint arXiv:2102.04378 (2021).
[8] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, Yuyin Zhou, “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation”, arXiv preprint arXiv:2102.04306 (2021).
[9] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan, “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, arXiv preprint arXiv:2101.11986 (2021).
[10] Yifan Jiang, Shiyu Chang, Zhangyang Wang, “TransGAN: Two Transformers Can Make One Strong GAN”, arXiv preprint arXiv:2102.07074 (2021).
[11] Is Space-Time Attention All You Need for Video Understanding.
[12] Towards Robust Vision Transformer.
[13] Zhao, Youpeng, et al. “Lightweight Vision Transformer with Cross Feature Attention.” arXiv preprint arXiv:2207.07268 (2022).
[14] Li, Jiashi, et al. “Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios.” arXiv preprint arXiv:2207.05501 (2022).
[15] Deng, Yingying, et al. “StyTr2: Image Style Transfer with Transformers.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[16] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, Houqiang Li. “Uformer: A General U-Shaped Transformer for Image Restoration.”
[17] Shuchen Weng, Jimeng Sun, Yu Li, Si Li, and Boxin Shi. “CT2: Colorization Transformer via Color Tokens”, ECCV, 2022.
Texture Bias
It is claimed in [1] [2] [3] that CNN is biased towards texture, that is, CNN tends to classify an object based on its texture intead of its shape.
In [4], it is claimed that texture-bias is caused by data augmentation approach. Using different data augmentation approaches can introduce either texture-bias or shape-bias. Similarly, [5] debiases shape and texture.
Reference
Geirhos, Robert, et al. “ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.” ICLR, 2019. [code]
Brochu, Francis. “Increasing Shape Bias in ImageNet-Trained Networks Using Transfer Learning and Domain-Adversarial Methods.” arXiv preprint arXiv:1907.12892 (2019).
Asadi, Nader, Mehrdad Hosseinzadeh, and Mahdi Eftekhari. “Towards Shape Biased Unsupervised Representation Learning for Domain Generalization.” arXiv preprint arXiv:1909.08245 (2019).
Hermann, Katherine, Ting Chen, and Simon Kornblith. “The origins and prevalence of texture bias in convolutional neural networks.” NeurIPS, (2020).
Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie, “SHAPE-TEXTURE DEBIASED NEURAL NETWORK TRAINING”, ICLR, 2021.
StyleGAN
- StyleGAN of all trades [1]
- StyleGANv1[5]
- StyleGANv2[6]: remove blob-shaped artifacts that resemble water droplets.
- StyleGANv3[2]: solve alias (texture sticking) issue, that is, detail appearing to glued to image coordinates instead of the surface of depicted objects.
- StyleGAN-XL [3]: extend to large dataset
- 3D styleGAN [4]
Image editing using styleGA
InsetGAN [7]
Reference
[1] Chong, Min Jin, Hsin-Ying Lee, and David Forsyth. “StyleGAN of All Trades: Image Manipulation with Only Pretrained StyleGAN.” arXiv preprint arXiv:2111.01619 (2021).
[2] Karras, Tero, et al. “Alias-free generative adversarial networks.” Thirty-Fifth Conference on Neural Information Processing Systems. 2021.
[3] Sauer, Axel, Katja Schwarz, and Andreas Geiger. “Stylegan-xl: Scaling stylegan to large diverse datasets.” arXiv preprint arXiv:2202.00273 (2022).
[4] Xiaoming Zhao, Fangchang Ma, David Güera, Zhile Ren, Alexander G. Schwing, Alex Colburn. “Generative Multiplane Images: Making a 2D GAN 3D-Aware”.
[5] Karras, Tero, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[6] Karras, Tero, et al. “Analyzing and improving the image quality of stylegan.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[7] Frühstück, Anna, et al. “Insetgan for full-body image generation.” CVPR, 2022.
Structural Reparameterization
References
[1] Zagoruyko, Sergey, and Nikos Komodakis. “Diracnets: Training very deep neural networks without skip-connections.” arXiv preprint arXiv:1706.00388 (2017).
[2] Ding, Xiaohan, et al. “Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks.” Proceedings of the IEEE International Conference on Computer Vision. 2019.
[3] “RepVGG: Making VGG-style ConvNets Great Again”, arXiv, 2021.
Statistics in Deep Learning
Reference
[1] Gao, Hang, et al. “Low-shot learning via covariance-preserving adversarial augmentation networks.” NIPS, 2018.
[2] Sun, Baochen, and Kate Saenko. “Deep coral: Correlation alignment for deep domain adaptation.” ECCV, 2016.
[3] Kumar, Abhishek, Prasanna Sattigeri, and Avinash Balakrishnan. “Variational inference of disentangled latent concepts from unlabeled observations.” ICLR, 2018.
Spatial Transformation
parametric transform (affine transformation, thin-plate translation, etc): STN [2], hierarchical STN [5], deformable style transfer [10]
learn conv offset: Deformable CNN v1[3], v2[4], deformable kernel [9]
optical flow: [8]
swap disentangled geometry-relevant feature
move keypoints: transGAGA [11]
Reference
[1] Recasens, Adria, et al. “Learning to zoom: a saliency-based sampling layer for neural networks.” ECCV, 2018.
[2] Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” NIPS, 2015.
[3] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei:
Deformable Convolutional Networks. ICCV 2017.
[4] Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai: Deformable ConvNets v2: More Deformable, Better Results. CoRR abs/1811.11168 (2018)
[5] Shu, Chang, et al. “Hierarchical Spatial Transformer Network.” arXiv preprint arXiv:1801.09467 (2018).
[6] Zheng, Heliang, et al. “Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition.” CVPR, 2019.
[7] Marin, Dmitrii, et al. “Efficient segmentation: Learning downsampling near semantic boundaries.” ICCV, 2019.
[8] Ren, Yurui, et al. “Deep Image Spatial Transformation for Person Image Generation.”, CVPR, 2020.
[9] Gao, Hang, et al. “Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation.” arXiv preprint arXiv:1910.02940 (2019).
[10] Kim, Sunnie SY, et al. “Deformable Style Transfer.” arXiv preprint arXiv:2003.11038 (2020).
[11] Wayne Wu, Kaidi Cao, Cheng Li, Chen Qian, Chen Change Loy: TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation. CVPR 2019