1. Distinguish generated fake images and real images in the freqency domain. [2]

  2. Use frequency map as network input or output [1] [5] [6]

  3. Use intermediate frequency features [7] [9]

  4. An image can be composed of or decomposed into low-frequency part and high-frequency part [3] [8] [4] [10]

Reference

  1. Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, Fengbo Ren, “Learning in the Frequency Domain”, CVPR, 2020.

  2. Wang, Sheng-Yu, et al. “CNN-generated images are surprisingly easy to spot… for now.” arXiv preprint arXiv:1912.11035 (2019).

  3. ayush Bansal, Yaser Sheikh, Deva Ramanan, “PixelNN: Example-based Image Synthesis”, ICLR 2018.

  4. Yanchao Yang, Stefano Soatto, “FDA: Fourier Domain Adaptation for Semantic Segmentation”, CVPR 2020.

  5. Roy, Hiya, et al. “Image inpainting using frequency domain priors.” arXiv preprint arXiv:2012.01832 (2020).

  6. Shen, Xing, et al. “DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation.” arXiv preprint arXiv:2011.09876 (2020).

  7. Suvorov, Roman, et al. “Resolution-robust Large Mask Inpainting with Fourier Convolutions.” WACV (2021).

  8. Yu, Yingchen, et al. “WaveFill: A Wavelet-based Generation Network for Image Inpainting.” ICCV, 2021.

  9. Mardani, Morteza, et al. “Neural ffts for universal texture image synthesis.” NeurIPS (2020).

  10. Cai, Mu, et al. “Frequency domain image translation: More photo-realistic, better identity-preserving.” ICCV, 2021.

  1. Predict visual feature of one future frame [1]

  2. Predict optical flow of one future frame [2]

  3. Predict one future frame [4] (a special case of video prediction)

  4. Predict future trajectories [5]

  5. Predict optical flows of future frames, and then obtain future frames [3]

Reference

  1. Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Anticipating visual representations from unlabeled video.” CVPR, 2016.

  2. Gao, Ruohan, Bo Xiong, and Kristen Grauman. “Im2flow: Motion hallucination from static images for action recognition.” CVPR, 2018.

  3. Li, Yijun, et al. “Flow-grounded spatial-temporal video prediction from still images.” ECCV, 2018.

  4. Xue, Tianfan, et al. “Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks.” NIPS, 2016.

  5. Walker, Jacob, et al. “An uncertain future: Forecasting from static images using variational autoencoders.” ECCV, 2016.

Few-shot Feature Generation

  1. Meta-learning method: [1]

  2. Delta-based: delta between each pair of samples [2]; delta between each sample and class center [3] [4]

Reference

[1] Zhang, Ruixiang, et al. “Metagan: An adversarial approach to few-shot learning.” NIPS, 2018.

[2] Schwartz, Eli, et al. “Delta-encoder: an effective sample synthesis method for few-shot object recognition.” Advances in Neural Information Processing Systems. 2018.

[3] Liu, Jialun, et al. “Deep Representation Learning on Long-tailed Data: A Learnable Embedding Augmentation Perspective.” arXiv preprint arXiv:2002.10826 (2020).

[4] Yin, Xi, et al. “Feature transfer learning for face recognition with under-represented data.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

  1. Light-weighted network structure

    SqueezeNet, MobileNet, and ShuffleNet share the same idea: decouple the temporal convolution and spatial convolution to reduce the nummber of parameters, sharing the similar spirit with Pseudo-3D Residual Networks. SqueezeNet is serial while MobileNet and ShuffleNet are parrallel. MobileNet is a special case of ShuffleNet when using only one group.

    Low-rank approximation ($k\times k \times c\times d = k\times k\times c\times d’ + 1\times 1\times d’\times d$) also falls into the above scope. The difference between MobileNet and Low-rank approximation is layerwise convolution or not.

  2. Tweak network structure

    • prune nodes based on certain criteria (e.g., response value, Fisher information): require special implementation and take up more space than expected due to irregular network structure.
  3. Compress weights

    • Quantization (fixed bit number): learn codebook and encode weights. Fine-tune codebook after quantizatizing weights, which averages the gradient of weights belonging to the same cluster. Extreme cases are binary net and ternary net. Binary (resp, ternary) net are quantized to [-1, 1] (resp, [-1, 0, 1]), with different weights $\alpha$ for different layers.
    • Huffman Coding (flexible bit number): applied after quantization for further compression.
  4. Computation

    • spatial domain to frequency domain: convert convolution to pointwise multiplication by using FFT
  5. Sparsity regularization

  6. Efficient Inference

    • cascade of networks, early exit network (predict whether to exit or not after each layer) [1] [2]

Good introduction slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture15.pdf

Multi-scale fusion: HED [1], RCF [2]

Reference

  1. Xie, Saining, and Zhuowen Tu. “Holistically-nested edge detection.” ICCV, 2015.

  2. Liu, Yun, et al. “Richer convolutional features for edge detection.” CVPR, 2017.

Dynamic kernels: [1] [2]

Survey: [Dynamic neural networks: A survey]

References

  1. Jia, Xu, et al. “Dynamic filter networks.” Advances in neural information processing systems 29 (2016).

  2. Tian, Zhi, Chunhua Shen, and Hao Chen. “Conditional convolutions for instance segmentation.” European conference on computer vision. Springer, Cham, 2020.

a) When the domain labels are known:

  • reduce the distance between different domains: MMD [1][2], mutual information

  • domain-invariant and domain-specific components: [1][2]

b) When the domain labels are unkown:

  • first discover multiple latent domains: cluster [1][2], max margin separation [1]

Methods

  1. learn projection matrix: F(PXs, QXt)

  2. sample selection: learn sample weights

  3. domain-invariant and domain-specific components

  4. low-rank reconstruction

  5. pixel-level image to image translation

    • paired input: conditional GAN [pdf]
    • unpaired input: cycling GAN [pdf], GAN with content-similarity loss [pdf], UNIT [pdf]
    • combine with feature-based method: GraspGAN [pdf]
    • A unified framework [pdf]
  6. adversarial network [1]: classification and domain confusion. The domain separation and confusion problem, which is a min-max problem, can be solved like GAN or using reverse gradient (RevGrad) algorithm.

  7. meta-learning

    • gradients on two domains should be consistent [pdf]
  8. domain alignment layer (batch normalization): [1] [2]

  9. guided learning: tutor guides students and get feedback from students. ACM-MM18 paper

  10. ensemble transfer learning: aggregate multiple transfer learning approaches [1]

Settings

  1. open-set domain adaptation or partial transfer learning: [1][2][3]

  2. distant domain adaptation (two domains are too distant, so the transfer between them relies on transition domains): Transitive transfer learning, distant domain transfer learning

  3. open compound domain adaptation [1]

Domain adaptation for diverse applications

  1. pose estimation [1]

  2. person re-identification [1]

  3. objection detection [1]

  4. segmentation [1]

  5. VQA [1]

Domain difference metric: To measure data distribution mismatch, the most commonly used metric is MMD and its extensions such as fast MMD, conditional MMD [1][2] and joint MMD. There are also some other metrics like KL divergence, HSIC criterion, Bregman divergence, manifold criterion, and second-order statistic.

Theories: A summary of related theories

Survey:

  1. An old survey of transfer learning [pdf]
  2. Recent advance on domain adaptation [pdf]
  3. My survey of old deep learning domain adaptation methods [pdf]
  4. A Chinese version of transfer learning tutorial [pdf]
  5. Datasets and code: [1]
  6. A Comprehensive Survey on Transfer Learning [pdf]

Methods:

The goal of Disentangled Representation [4] is to extract explanatory factors of the data in the input distribution and generate a more meaningful representation. disentangle codes/encodings/representations/latent factors/latent variables. single-dimension attribute encoding or multi-dimension attribute encoding.

A math definition of disentangled representation [11]

A survey on disentangled representation learning [19]

  • Unsupervised disentanglement

    Recently, InfoGAN [5] utilizes GAN framework and maximizes the mutual information between a subset of the latent variables to learn disentangled representations in an unsupervised manner. Different latent variables are enforced to be independent based on the independence assumption [6].

  • Supervised disentanglement

    Swapping attribute representation with the supervision of attribute annotation such as Dual Swap GAN [7] (semi-supervised) and DNA-GAN [8].

  • Disentangle representation for domain adaptation, disentangle representation into Class/domain-invariant and class/domain-specific: [9][10][12] [13]

  • instance-level disentangle[14] [15] FUNIT[16] COCO-FUNIT[17

  • close-form disentanglement [18]: after the model is trained, perform eigen decomposition to obtain orthogonal directions.

Disentanglement metric:

  • disentangement metric score [1])
  • perceptual path length, linear separabilit [2]

Reference

[1] Higgins, Irina, et al. “beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.” ICLR 2.5 (2017): 6.

[2] Karras, Tero, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks.” CVPR, 2019.

[4] Representation learning: A review and new perspectives

[5] Infogan: Interpretable representation learning by information maximizing generative adversarial nets

[6] Learning Independent Features with adversarial Nets for Non-linear ICA

[7] Dual Swap Disentangling

[8] DNA-GAN: Learning Disentangled Representations from Multi-Attribute Images

[9] Image-to-image translation for cross-domain disentanglement

[10] Diverse Image-to-Image Translation via Disentangled Representations

[11] Higgins, Irina, et al. “Towards a definition of disentangled representations.” arXiv preprint arXiv:1812.02230 (2018).

[12] Gabbay, Aviv, and Yedid Hoshen. “Demystifying Inter-Class Disentanglement.” arXiv preprint arXiv:1906.11796 (2019).

[13] Hadad, Naama, Lior Wolf, and Moni Shahar. “A two-step disentanglement method.” CVPR, 2018.

[14] Shen, Zhiqiang, et al. “Towards instance-level image-to-image translation.” CVPR, 2019.

[15] Sangwoo Mo, Minsu Cho, Jinwoo Shin:
InstaGAN: Instance-aware Image-to-Image Translation. ICLR, 2019.

[16] Liu, Ming-Yu, et al. “Few-shot unsupervised image-to-image translation.” ICCV, 2019.

[17] Saito, Kuniaki, Kate Saenko, and Ming-Yu Liu. “COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder.” arXiv preprint arXiv:2007.07431 (2020).

[18] Shen, Yujun, and Bolei Zhou. “Closed-Form Factorization of Latent Semantics in GANs.” arXiv preprint arXiv:2007.06600 (2020).

[19] Xin Wang, Hong Chen, Siao Tang, Zihao Wu, and Wenwu Zhu. “Disentangled Representation Learning.”

  • CelebA [14] (dataset for human faces): [12, 2, 11, 17, 13, 8, 13, 18]
  • MNIST [10], MNIST-M [4] (digits): [16, 15, 12, 5, 2, 11, 9, 17, 6, 8, 13, 3]
  • Yosemite [19] (summer and winter scenes): [11]
  • Artworks [19] (Monet and Van Gogh): [11]
  • 2D Sprites (game characters): [15, 9, 6, 8, 3]
  • LineMod [7] (3D object): [9]
  • 11k Hands [1] (hand gestures): [17]

Reference

[1] M. Afifi. Gender recognition and biometric identification using a large dataset of hand images. arXiv preprint arXiv:1711.04322, 2017.

[2] E. Dupont. Learning disentangled joint continuous and discrete representations. In S. Bengio, H.Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 708–718. Curran Associates, Inc., 2018.

[3] Z. Feng, X. Wang, C. Ke, A.-X. Zeng, D. Tao, and M. Song. Dual swap disentangling. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5898–5908. Curran Associates, Inc., 2018.

[4] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.

[5] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio. Image-to-image translation for cross-domain disentanglement. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1294–1305. Curran Associates, Inc., 2018.

[6] N. Hadad, L. Wolf, and M. Shahar. A two-step disentanglement method. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[7] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pages 548–562. Springer, 2012.

[8] Q. Hu, A. Szab, T. Portenier, P. Favaro, and M. Zwicker. Disentangling factors of variation by mixing them. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[9] A. H. Jha, S. Anand, M. Singh, and V. Veeravasarapu. Disentangling factors of variation with cycle-consistent variational autoencoders. In The European Conference on Computer Vision (ECCV), September 2018.

[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[11] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang. Diverse image-to-image translation via disentangled representations. In The European Conference on Computer Vision (ECCV), September 2018.

[12] A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang. A unified feature disentangler for multi-domain image translation and manipulation. In S. Bengio, H.Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2595–2604. Curran Associates, Inc., 2018.

[13] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang. Exploring disentangled feature representation beyond face identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[14] Z. Liu, P. Luo, X.Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.

[15] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 5040–5048. Curran Associates, Inc., 2016.

[16] S. Narayanaswamy, T. B. Paige, J.-W. van de Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood, and P. Torr. Learning disentangled representations with semi-supervised deep generative models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5925–5935. Curran Associates, Inc., 2017.

[17] Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In The European Conference on Computer Vision (ECCV), September 2018.

[18] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

[19] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint, 2017.

0%