Multi-modality Fusion
- Concatenation/summation, or weighted (attention mechanism) concatenation/summation.
- P(y|x1,x2)=P(y|x1)P(y|x2), with Gaussian distribution assumption [1]
Reference
- Huang, Xun, et al. “Multimodal Conditional Image Synthesis with Product-of-Experts GANs.” arXiv preprint arXiv:2112.05130 (2021).
Multi-modal Problem
Multi-modal problem means that given an input, there exist multiple possible outputs instead of a single deterministic output. The key problem is the mode collapse problem.
The ground-truth output belongs to one of K generated possibilities [1]. K is set beforehand.
Ensure bijection between random vector and output: Associate random factor (e.g., random vector z) with specific information [2]. Either random factor is conditioned on specific information, or the generated output can recognize random factor. If the mapping from random vector to output is invertible (e.g., glow), there is a natural bijection between random vector and output [6].

Enforce different random vectors to produce different outputs: push apart the outputs generated from different random vectors z with diversity loss or mode seeking loss [3] [4] [5]
Reference
[1] Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Anticipating visual representations from unlabeled video.” CVPR, 2016.
[2] Zhu, Jun-Yan, et al. “Toward multimodal image-to-image translation.” Advances in Neural Information Processing Systems. 2017.
[3] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR, 2019.
[4] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. Diversity-sensitive
conditional generative adversarial networks. arXiv preprint arXiv:1901.09024, 2019.
[5] Shaohui Liu, Xiao Zhang, Jianqiao Wangni, Jianbo Shi: Normalized Diversification. CVPR 2019: 10306-10315
[6] Lugmayr, Andreas, et al. “SRFlow: Learning the Super-Resolution Space with Normalizing Flow.” arXiv preprint arXiv:2006.14200 (2020).
MLP
Reference
[1] Tolstikhin, Ilya O., et al. “Mlp-mixer: An all-mlp architecture for vision.” Advances in Neural Information Processing Systems 34 (2021).
[2] Melas-Kyriazi, Luke. “Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet.” arXiv preprint arXiv:2105.02723 (2021).
[3] Lian, Dongze, et al. “As-mlp: An axial shifted mlp architecture for vision.” arXiv preprint arXiv:2107.08391 (2021).
Memory Network
First paper of memory network: I(input feature map), G(generalization), O(output feature map), R(response), use the following objective function to optimize the variables in I,G,O,R.

end-to-end memory network: easy back-propagation

Mask-aided Object Detection
Help generate proposal:
Help generate proposal:
Combine semantic mask with feature map (e.g., concatenation, summation) to help predict bounding boxes: [1] [2] [3]
Generate proposals from semantic mask: [4]
Help select proposal:
Assign weights to proposals based on semantic mask: [5]
Use semantic mask surrounding each proposal as auxilary feature: [6]
Reference
[1] Yan Liu, Zhijie Zhang, Li Niu, Junjie Chen, Liqing Zhang, “Mixed Supervised Object Detection by Transferring Mask Prior and Semantic Similarity”, NeurIPS, 2021.
[2] Zitian Chen, Zhiqiang Shen, Jiahui Yu, Erik Learned-Miller: “Cross-Supervised Object Detection.” arXiv preprint arXiv:2006.15056 (2020)
[3] Zhao, Xiangyun, Shuang Liang, and Yichen Wei. “Pseudo mask augmented object detection.” CVPR, 2018.
[4] Diba, Ali, et al. “Weakly supervised cascaded convolutional networks.” CVPR, 2017.
[5] Li, Xiaoyan, et al. “Weakly supervised object detection with segmentation collaboration.” ICCV, 2019.
[6] Wei, Yunchao, et al. “Ts2c: Tight box mining with surrounding segmentation context for weakly supervised object detection.” ECCV, 2018.
LoRA Finetuning
Merging multiple LoRAs: orthogonal weights, masked dimensions
Decomposition
Layout Generation
Reference
Zheng, Xinru, et al. “Content-aware generative modeling of graphic design layouts.” ACM Transactions on Graphics (TOG) 38.4 (2019): 1-15.
Lee, Hsin-Ying, et al. “Neural design network: Graphic layout generation with constraints.” ECCV, 2020.
Gupta, Kamal, et al. “Layout Generation and Completion with Self-attention.” arXiv preprint arXiv:2006.14615 (2020).
Jyothi, Akash Abdu, et al. “Layoutvae: Stochastic scene layout generation from a label set.” ICCV, 2019.
Li, Jianan, et al. “Layoutgan: Generating graphic layouts with wireframe discriminators.” ICLR, 2019.
Arroyo, Diego Martin, Janis Postels, and Federico Tombari. “Variational Transformer Networks for Layout Generation.” CVPR, 2021.
Patil, Akshay Gadi, et al. “Read: Recursive autoencoders for document layout generation.” CVPR Workshops. 2020.
Jigsaw Puzzle
Reference
[1] Noroozi, Mehdi, and Paolo Favaro. “Unsupervised learning of visual representations by solving jigsaw puzzles.” ECCV, 2016.
[2] Shen, Wan Xiang, et al. “AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks.” Nucleic Acids Research (2022).