Newly Blog

Network Architecture

Posted on 2026-03-17 Edited on 2022-07-15 In paper note

Transformer
Large kernel: [1] [2] [3]

Reference

[1] Liu, Zhuang, et al. “A convnet for the 2020s.” CVPR, 2022.

[2] Ding, Xiaohan, et al. “Scaling up your kernels to 31x31: Revisiting large kernel design in cnns.” CVPR, 2022.

[3] More ConvNets in the 2020s: Scaling up Kernels Beyond 51 × 51 using Sparsity

NERF

Posted on 2026-03-17 Edited on 2022-08-22 In paper note

NERF [1]
GIRAFFE [2]

Reference

[1] Mildenhall, Ben, et al. “Nerf: Representing scenes as neural radiance fields for view synthesis.” ECCV, 2020.

[2] Niemeyer, Michael, and Andreas Geiger. “Giraffe: Representing scenes as compositional generative neural feature fields.” CVPR, 2021.

Multi-modality Fusion

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Concatenation/summation, or weighted (attention mechanism) concatenation/summation.
P(y|x1,x2)=P(y|x1)P(y|x2), with Gaussian distribution assumption [1]

Reference

Huang, Xun, et al. “Multimodal Conditional Image Synthesis with Product-of-Experts GANs.” arXiv preprint arXiv:2112.05130 (2021).

Multi-modal Problem

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Multi-modal problem means that given an input, there exist multiple possible outputs instead of a single deterministic output. The key problem is the mode collapse problem.

The ground-truth output belongs to one of K generated possibilities [1]. K is set beforehand.
Ensure bijection between random vector and output: Associate random factor (e.g., random vector z) with specific information [2]. Either random factor is conditioned on specific information, or the generated output can recognize random factor. If the mapping from random vector to output is invertible (e.g., glow), there is a natural bijection between random vector and output [6].
Enforce different random vectors to produce different outputs: push apart the outputs generated from different random vectors z with diversity loss or mode seeking loss [3] [4] [5]

Reference

[1] Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Anticipating visual representations from unlabeled video.” CVPR, 2016.

[2] Zhu, Jun-Yan, et al. “Toward multimodal image-to-image translation.” Advances in Neural Information Processing Systems. 2017.

[3] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR, 2019.

[4] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. Diversity-sensitive
conditional generative adversarial networks. arXiv preprint arXiv:1901.09024, 2019.

[5] Shaohui Liu, Xiao Zhang, Jianqiao Wangni, Jianbo Shi: Normalized Diversification. CVPR 2019: 10306-10315

[6] Lugmayr, Andreas, et al. “SRFlow: Learning the Super-Resolution Space with Normalizing Flow.” arXiv preprint arXiv:2006.14200 (2020).

MLP

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

classification [1] [2]
detection, segmentation [3]

Reference

[1] Tolstikhin, Ilya O., et al. “Mlp-mixer: An all-mlp architecture for vision.” Advances in Neural Information Processing Systems 34 (2021).

[2] Melas-Kyriazi, Luke. “Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet.” arXiv preprint arXiv:2105.02723 (2021).

[3] Lian, Dongze, et al. “As-mlp: An axial shifted mlp architecture for vision.” arXiv preprint arXiv:2107.08391 (2021).

Mixture-of-Experts

Posted on 2026-03-17 Edited on 2024-05-08 In paper note

The first paper: [1]

SwitchTransformer: [2]

Reference

Memory Network

Posted on 2026-03-17 Edited on 2022-09-02 In paper note

First paper of memory network: I(input feature map), G(generalization), O(output feature map), R(response), use the following objective function to optimize the variables in I,G,O,R.
end-to-end memory network: easy back-propagation

semi-supervised learning with memory module [1]
few-shot learning with memory module [1] [2] [3]
global memory [1]
short-term memory and long-term memory [1]

Mask-aided Object Detection

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Help generate proposal:

Combine semantic mask with feature map (e.g., concatenation, summation) to help predict bounding boxes: [1] [2] [3]
Generate proposals from semantic mask: [4]

Help select proposal:

Assign weights to proposals based on semantic mask: [5]
Use semantic mask surrounding each proposal as auxilary feature: [6]

Reference

[1] Yan Liu, Zhijie Zhang, Li Niu, Junjie Chen, Liqing Zhang, “Mixed Supervised Object Detection by Transferring Mask Prior and Semantic Similarity”, NeurIPS, 2021.

[2] Zitian Chen, Zhiqiang Shen, Jiahui Yu, Erik Learned-Miller: “Cross-Supervised Object Detection.” arXiv preprint arXiv:2006.15056 (2020)

[3] Zhao, Xiangyun, Shuang Liang, and Yichen Wei. “Pseudo mask augmented object detection.” CVPR, 2018.

[4] Diba, Ali, et al. “Weakly supervised cascaded convolutional networks.” CVPR, 2017.

[5] Li, Xiaoyan, et al. “Weakly supervised object detection with segmentation collaboration.” ICCV, 2019.

[6] Wei, Yunchao, et al. “Ts2c: Tight box mining with surrounding segmentation context for weakly supervised object detection.” ECCV, 2018.

LoRA Finetuning

Posted on 2026-03-17 Edited on 2026-01-02 In paper note

Merging multiple LoRAs: orthogonal weights, masked dimensions
Decomposition

Layout Generation

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

VAE/GAN: [1] [6] [7](hierarchical encoder/decoder)
GNN: [2] [5]
autoregressive: [3] [4]

Reference

Zheng, Xinru, et al. “Content-aware generative modeling of graphic design layouts.” ACM Transactions on Graphics (TOG) 38.4 (2019): 1-15.
Lee, Hsin-Ying, et al. “Neural design network: Graphic layout generation with constraints.” ECCV, 2020.
Gupta, Kamal, et al. “Layout Generation and Completion with Self-attention.” arXiv preprint arXiv:2006.14615 (2020).
Jyothi, Akash Abdu, et al. “Layoutvae: Stochastic scene layout generation from a label set.” ICCV, 2019.
Li, Jianan, et al. “Layoutgan: Generating graphic layouts with wireframe discriminators.” ICLR, 2019.
Arroyo, Diego Martin, Janis Postels, and Federico Tombari. “Variational Transformer Networks for Layout Generation.” CVPR, 2021.
Patil, Akshay Gadi, et al. “Read: Recursive autoencoders for document layout generation.” CVPR Workshops. 2020.