1. Concatenation/summation, or weighted (attention mechanism) concatenation/summation.
  2. P(y|x1,x2)=P(y|x1)P(y|x2), with Gaussian distribution assumption [1]

Reference

  1. Huang, Xun, et al. “Multimodal Conditional Image Synthesis with Product-of-Experts GANs.” arXiv preprint arXiv:2112.05130 (2021).

Multi-modal problem means that given an input, there exist multiple possible outputs instead of a single deterministic output. The key problem is the mode collapse problem.

  1. The ground-truth output belongs to one of K generated possibilities [1]. K is set beforehand.

  2. Ensure bijection between random vector and output: Associate random factor (e.g., random vector z) with specific information [2]. Either random factor is conditioned on specific information, or the generated output can recognize random factor. If the mapping from random vector to output is invertible (e.g., glow), there is a natural bijection between random vector and output [6].

  3. Enforce different random vectors to produce different outputs: push apart the outputs generated from different random vectors z with diversity loss or mode seeking loss [3] [4] [5]

Reference

[1] Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Anticipating visual representations from unlabeled video.” CVPR, 2016.

[2] Zhu, Jun-Yan, et al. “Toward multimodal image-to-image translation.” Advances in Neural Information Processing Systems. 2017.

[3] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR, 2019.

[4] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. Diversity-sensitive
conditional generative adversarial networks. arXiv preprint arXiv:1901.09024, 2019.

[5] Shaohui Liu, Xiao Zhang, Jianqiao Wangni, Jianbo Shi: Normalized Diversification. CVPR 2019: 10306-10315

[6] Lugmayr, Andreas, et al. “SRFlow: Learning the Super-Resolution Space with Normalizing Flow.” arXiv preprint arXiv:2006.14200 (2020).

  • classification [1] [2]
  • detection, segmentation [3]

Reference

[1] Tolstikhin, Ilya O., et al. “Mlp-mixer: An all-mlp architecture for vision.” Advances in Neural Information Processing Systems 34 (2021).

[2] Melas-Kyriazi, Luke. “Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet.” arXiv preprint arXiv:2105.02723 (2021).

[3] Lian, Dongze, et al. “As-mlp: An axial shifted mlp architecture for vision.” arXiv preprint arXiv:2107.08391 (2021).

  • First paper of memory network: I(input feature map), G(generalization), O(output feature map), R(response), use the following objective function to optimize the variables in I,G,O,R.

  • end-to-end memory network: easy back-propagation

  • semi-supervised learning with memory module [1]

  • few-shot learning with memory module [1] [2] [3]

  • global memory [1]

  • short-term memory and long-term memory [1]

Help generate proposal:

Help generate proposal:

  1. Combine semantic mask with feature map (e.g., concatenation, summation) to help predict bounding boxes: [1] [2] [3]

  2. Generate proposals from semantic mask: [4]

Help select proposal:

  1. Assign weights to proposals based on semantic mask: [5]

  2. Use semantic mask surrounding each proposal as auxilary feature: [6]

Reference

[1] Yan Liu, Zhijie Zhang, Li Niu, Junjie Chen, Liqing Zhang, “Mixed Supervised Object Detection by Transferring Mask Prior and Semantic Similarity”, NeurIPS, 2021.

[2] Zitian Chen, Zhiqiang Shen, Jiahui Yu, Erik Learned-Miller: “Cross-Supervised Object Detection.” arXiv preprint arXiv:2006.15056 (2020)

[3] Zhao, Xiangyun, Shuang Liang, and Yichen Wei. “Pseudo mask augmented object detection.” CVPR, 2018.

[4] Diba, Ali, et al. “Weakly supervised cascaded convolutional networks.” CVPR, 2017.

[5] Li, Xiaoyan, et al. “Weakly supervised object detection with segmentation collaboration.” ICCV, 2019.

[6] Wei, Yunchao, et al. “Ts2c: Tight box mining with surrounding segmentation context for weakly supervised object detection.” ECCV, 2018.

  1. Merging multiple LoRAs: orthogonal weights, masked dimensions

  2. Decomposition

  1. VAE/GAN: [1] [6] [7](hierarchical encoder/decoder)
  2. GNN: [2] [5]
  3. autoregressive: [3] [4]

Reference

  1. Zheng, Xinru, et al. “Content-aware generative modeling of graphic design layouts.” ACM Transactions on Graphics (TOG) 38.4 (2019): 1-15.

  2. Lee, Hsin-Ying, et al. “Neural design network: Graphic layout generation with constraints.” ECCV, 2020.

  3. Gupta, Kamal, et al. “Layout Generation and Completion with Self-attention.” arXiv preprint arXiv:2006.14615 (2020).

  4. Jyothi, Akash Abdu, et al. “Layoutvae: Stochastic scene layout generation from a label set.” ICCV, 2019.

  5. Li, Jianan, et al. “Layoutgan: Generating graphic layouts with wireframe discriminators.” ICLR, 2019.

  6. Arroyo, Diego Martin, Janis Postels, and Federico Tombari. “Variational Transformer Networks for Layout Generation.” CVPR, 2021.

  7. Patil, Akshay Gadi, et al. “Read: Recursive autoencoders for document layout generation.” CVPR Workshops. 2020.

  1. Reorganize patches [1]

  2. Reorganize pixels [2]

Reference

[1] Noroozi, Mehdi, and Paolo Favaro. “Unsupervised learning of visual representations by solving jigsaw puzzles.” ECCV, 2016.

[2] Shen, Wan Xiang, et al. “AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks.” Nucleic Acids Research (2022).

  1. Manipulate each layer/neuron, and observe the change of network parameters/activations.

  2. Saliency map

  3. Adversarial attack

  4. Correlation

  5. Information gain/loss

0%