Deep EM

Posted on 2022-06-16 | In paper note

Learning from Massive Noisy Labeled Data for Image Classification: hidden variable is the label noise type
Expectation-Maximization Attention Networks for Semantic Segmentation: hidden variable is dictionary basis

Cut and Paste

Posted on 2022-06-16 | In paper note

Do segmentation, image enhancemnet, and inpainting simultaneously [1]
Learning to Segment via Cut-and-Paste [2]

Reference

[1] Ostyakov, Pavel, et al. “SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint.” arXiv preprint arXiv:1811.07630 (2018).

[2] Remez, Tal, Jonathan Huang, and Matthew Brown. “Learning to segment via cut-and-paste.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

Conditional GAN

Posted on 2022-06-16 | In paper note

Conditioned on label vector: conditional GAN [4], CVAE-GAN [6]
Conditioned on a single image
- pix2pix [1]; high-resolution pix2pix [2] (add coarse-to-fine strategy); BicycleGAN [3] (combination of cVAE-GAN and cLR-GAN)
- DAGAN [5]

Reference

[1] Isola, Phillip, et al. “Image-to-image translation with conditional adversarial networks.” CVPR, 2017

[2] Wang, Ting-Chun, et al. “High-resolution image synthesis and semantic manipulation with conditional gans.” CVPR, 2018.

[3] Zhu, Jun-Yan, et al. “Toward multimodal image-to-image translation.” NIPS, 2017.

[4] Mirza, Mehdi, and Simon Osindero. “Conditional generative adversarial nets.” arXiv preprint arXiv:1411.1784 (2014).

[5] Antoniou, Antreas, Amos Storkey, and Harrison Edwards. “Data augmentation generative adversarial networks.” arXiv preprint arXiv:1711.04340 (2017).

[6] Bao, Jianmin, et al. “CVAE-GAN: fine-grained image generation through asymmetric training.” ICCV, 2017.

Color Mapping

Posted on 2022-06-16 | In paper note

Global color mapping:

3D LUT: [1], [2] non-uniform LUT
curve function: [1]
linear transformation: [1]

Local color mapping:

3D LUT: [1]
curve function: [1], DCE[2]
linear transformation: HDRNet[1]

CLIP

Posted on 2022-06-16 | In paper note

image classication: CLIP [1], learnable prompt [2]
video classification: ActionCLIP [3]
object detection: ViLD [4], ZSD-YOLO [6]
segmentation: [8] [9]
visual grounding: CPT [5]
image translation: StyleClip [7]

Reference

[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” arXiv preprint arXiv:2103.00020 (2021).

[2] Zhou, Kaiyang, et al. “Learning to Prompt for Vision-Language Models.” arXiv preprint arXiv:2109.01134 (2021).

[3] Wang, Mengmeng, Jiazheng Xing, and Yong Liu. “ActionCLIP: A New Paradigm for Video Action Recognition.” arXiv preprint arXiv:2109.08472 (2021).

[4] Gu, Xiuye, et al. “Zero-Shot Detection via Vision and Language Knowledge Distillation.” arXiv preprint arXiv:2104.13921 (2021).

[5] Yao, Yuan, et al. “CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models.” arXiv preprint arXiv:2109.11797 (2021).

[6] Xie, Johnathan, and Shuai Zheng. “ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation.” arXiv preprint arXiv:2109.12066 (2021).

[7] Patashnik, Or, et al. “Styleclip: Text-driven manipulation of stylegan imagery.” ICCV, 2021.

[8] Xu, Mengde, et al. “A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model.” arXiv preprint arXiv:2112.14757 (2021).

[9] Lüddecke, Timo, and Alexander Ecker. “Image Segmentation Using Text and Image Prompts.” CVPR, 2022.

Capsule Network

Posted on 2022-06-16 | In paper note

Typical works: CapsNet [1], CapProNet [2] [code]
The robustness of Capsule network: [3]

Reference

[1] Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules[C]//Advances in Neural Information Processing Systems. 2017: 3856-3866.

[2] Zhang L, Edraki M, Qi G J. CapProNet: Deep feature learning via orthogonal projections onto capsule subspaces[J]. arXiv preprint arXiv:1805.07621, 2018.

[3] Jindong Gu, Volker Tresp, Han Hu, “Capsule Network is Not More Robust than Convolutional Network”, CVPR 2021.

Camouflaged Object Detection and Segmentation

Posted on 2022-06-16 | In paper note

Camouflaged Object Detection: [1]
Camouflaged Object Segmentation [2]

Reference

Fan, Deng-Ping, et al. “Camouflaged object detection.” CVPR, 2020.
Yan, Jinnan, et al. “MirrorNet: Bio-Inspired Adversarial Attack for Camouflaged Object Segmentation.” arXiv preprint arXiv:2007.12881 (2020).

Boundary-guided Semantic Segmentation

Posted on 2022-06-16 | In paper note

propagate information within each non-boundary region [1]
focus on unconfident boundary regions [2]
fuse boundary feature and image feature [3]

Reference

[1] Ding, Henghui, et al. “Boundary-aware feature propagation for scene segmentation.” ICCV, 2019.

[2] Marin, Dmitrii, et al. “Efficient segmentation: Learning downsampling near semantic boundaries.” ICCV, 2019.

[3] Takikawa, Towaki, et al. “Gated-scnn: Gated shape cnns for semantic segmentation.” ICCV, 2019.

Bio-inspired Network

Posted on 2022-06-16 | In paper note

Use the first few layers to simulate neuron activation in human brain [1]
Use the attention learnt by network to mimick human attention [2]

Reference

[1] Dapello, Joel, et al. “Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations.” BioRxiv (2020).

[2] Linsley, Drew, et al. “Learning what and where to attend.” arXiv preprint arXiv:1805.08819 (2018).

Attention Mechanism

Posted on 2022-06-16 | In paper note

Attention in CNN:

According to [4], attention can be categorized into bottom-up attention (visual saliency, unsupervised) and top-down attention (task-driven, supervised).

According to [5], attention can be categorized into forward attention, post-hoc attention, and query-based attention.

forward attention: spatial attention [16], channel attention [10] [17] [18], full attention [11], Deformable CNN v1 [8] v2 [9],
post-hoc attention: CAM [6], GradCAM [7], scoreCAM [14], trainable CAM [20][21]
query-based attention: [5]
erase attention: [19] [12] [13]
high-order attention [15]

Attention in RNN:

survey paper: survey on the attention based RNN model and its applications in computer vision [1]

soft/hard attention: binary weight or soft weight
item-wise/location-wise attention: location-wise attention is to convert an image to a sequence of local regions, which is essentially item-wise.

Earliest papers [2] [3] are basically the same except design specs of RNN unit.

Reference

[1] Wang, Feng, and David MJ Tax. “Survey on the attention based RNN model and its applications in computer vision.” arXiv preprint arXiv:1601.06823 (2016).

[2] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).

[3] Vinyals, Oriol, et al. “Grammar as a foreign language.” NIPS, 2015.

[4] Drew Linsley, Dan Shiebler, Sven Eberhardt, Thomas Serre: Learning what and where to attend. ICLR, 2019.

[5] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, Philip H. S. Torr: Learn to Pay Attention. ICLR, 2018.

[6] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, Antonio Torralba: Learning Deep Features for Discriminative Localization. CVPR 2016.

[7] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra:
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. ICCV 2017.

[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei:
Deformable Convolutional Networks. ICCV 2017.

[9] Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai: Deformable ConvNets v2: More Deformable, Better Results. CoRR abs/1811.11168 (2018)

[10] Wei Li, Xiatian Zhu, Shaogang Gong: Harmonious Attention Network for Person Re-Identification. CVPR 2018.

[11] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, Xinggang Wang: Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification. ECCV (4) 2018.

[12] Zintgraf, Luisa M., et al. “Visualizing deep neural network decisions: Prediction difference analysis.” arXiv preprint arXiv:1702.04595 (2017).

[13] Fong, Ruth C., and Andrea Vedaldi. “Interpretable explanations of black boxes by meaningful perturbation.” ICCV, 2017.

[14] Wang, Haofan, et al. “Score-CAM: Improved Visual Explanations Via Score-Weighted Class Activation Mapping.” arXiv preprint arXiv:1910.01279 (2019).

[15] Chen, Binghui, Weihong Deng, and Jiani Hu. “Mixed high-order attention network for person re-identification.” Proceedings of the IEEE International Conference on Computer Vision. 2019.

[16] Zhu, Xizhou, et al. “An empirical study of spatial attention mechanisms in deep networks.” ICCV, 2019.

[17] Wang, Qilong, et al. “ECA-net: Efficient channel attention for deep convolutional neural networks.” CVPR, 2020.

[18] Qin, Zequn, et al. “FcaNet: Frequency Channel Attention Networks.” arXiv preprint arXiv:2012.11879 (2020).

[19] Zhang, Xiaolin, et al. “Adversarial complementary learning for weakly supervised object localization.” CVPR, 2018.

[20] Jo, Sanhyun, and In-Jae Yu. “Puzzle-CAM: Improved localization via matching partial and full features.” arXiv preprint arXiv:2101.11253 (2021).

[21] Araslanov, Nikita, and Stefan Roth. “Single-stage semantic segmentation from image labels.” CVPR, 2020.