CLIP
Reference
[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” arXiv preprint arXiv:2103.00020 (2021).
[2] Zhou, Kaiyang, et al. “Learning to Prompt for Vision-Language Models.” arXiv preprint arXiv:2109.01134 (2021).
[3] Wang, Mengmeng, Jiazheng Xing, and Yong Liu. “ActionCLIP: A New Paradigm for Video Action Recognition.” arXiv preprint arXiv:2109.08472 (2021).
[4] Gu, Xiuye, et al. “Zero-Shot Detection via Vision and Language Knowledge Distillation.” arXiv preprint arXiv:2104.13921 (2021).
[5] Yao, Yuan, et al. “CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models.” arXiv preprint arXiv:2109.11797 (2021).
[6] Xie, Johnathan, and Shuai Zheng. “ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation.” arXiv preprint arXiv:2109.12066 (2021).
[7] Patashnik, Or, et al. “Styleclip: Text-driven manipulation of stylegan imagery.” ICCV, 2021.
[8] Xu, Mengde, et al. “A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model.” arXiv preprint arXiv:2112.14757 (2021).
[9] Lüddecke, Timo, and Alexander Ecker. “Image Segmentation Using Text and Image Prompts.” CVPR, 2022.
Capsule Network
Reference
[1] Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules[C]//Advances in Neural Information Processing Systems. 2017: 3856-3866.
[2] Zhang L, Edraki M, Qi G J. CapProNet: Deep feature learning via orthogonal projections onto capsule subspaces[J]. arXiv preprint arXiv:1805.07621, 2018.
[3] Jindong Gu, Volker Tresp, Han Hu, “Capsule Network is Not More Robust than Convolutional Network”, CVPR 2021.
Bio-inspired Network
Use the first few layers to simulate neuron activation in human brain [1]
Use the attention learnt by network to mimick human attention [2]
Reference
[1] Dapello, Joel, et al. “Simulating a primary visual cortex at the front of CNNs improves robustness to image perturbations.” BioRxiv (2020).
[2] Linsley, Drew, et al. “Learning what and where to attend.” arXiv preprint arXiv:1805.08819 (2018).
Attention Mechanism
Attention in CNN:
According to [4], attention can be categorized into bottom-up attention (visual saliency, unsupervised) and top-down attention (task-driven, supervised).
According to [5], attention can be categorized into forward attention, post-hoc attention, and query-based attention.
forward attention: spatial attention [16], channel attention [10] [17] [18], full attention [11], Deformable CNN v1 [8] v2 [9],
post-hoc attention: CAM [6], GradCAM [7], scoreCAM [14], trainable CAM [20][21]
query-based attention: [5]
high-order attention [15]
Attention in RNN:
survey paper: survey on the attention based RNN model and its applications in computer vision [1]
soft/hard attention: binary weight or soft weight
item-wise/location-wise attention: location-wise attention is to convert an image to a sequence of local regions, which is essentially item-wise.
Earliest papers [2] [3] are basically the same except design specs of RNN unit.
Reference
[1] Wang, Feng, and David MJ Tax. “Survey on the attention based RNN model and its applications in computer vision.” arXiv preprint arXiv:1601.06823 (2016).
[2] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
[3] Vinyals, Oriol, et al. “Grammar as a foreign language.” NIPS, 2015.
[4] Drew Linsley, Dan Shiebler, Sven Eberhardt, Thomas Serre: Learning what and where to attend. ICLR, 2019.
[5] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, Philip H. S. Torr: Learn to Pay Attention. ICLR, 2018.
[6] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, Antonio Torralba: Learning Deep Features for Discriminative Localization. CVPR 2016.
[7] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra:
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. ICCV 2017.
[8] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei:
Deformable Convolutional Networks. ICCV 2017.
[9] Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai: Deformable ConvNets v2: More Deformable, Better Results. CoRR abs/1811.11168 (2018)
[10] Wei Li, Xiatian Zhu, Shaogang Gong: Harmonious Attention Network for Person Re-Identification. CVPR 2018.
[11] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, Xinggang Wang: Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification. ECCV (4) 2018.
[12] Zintgraf, Luisa M., et al. “Visualizing deep neural network decisions: Prediction difference analysis.” arXiv preprint arXiv:1702.04595 (2017).
[13] Fong, Ruth C., and Andrea Vedaldi. “Interpretable explanations of black boxes by meaningful perturbation.” ICCV, 2017.
[14] Wang, Haofan, et al. “Score-CAM: Improved Visual Explanations Via Score-Weighted Class Activation Mapping.” arXiv preprint arXiv:1910.01279 (2019).
[15] Chen, Binghui, Weihong Deng, and Jiani Hu. “Mixed high-order attention network for person re-identification.” Proceedings of the IEEE International Conference on Computer Vision. 2019.
[16] Zhu, Xizhou, et al. “An empirical study of spatial attention mechanisms in deep networks.” ICCV, 2019.
[17] Wang, Qilong, et al. “ECA-net: Efficient channel attention for deep convolutional neural networks.” CVPR, 2020.
[18] Qin, Zequn, et al. “FcaNet: Frequency Channel Attention Networks.” arXiv preprint arXiv:2012.11879 (2020).
[19] Zhang, Xiaolin, et al. “Adversarial complementary learning for weakly supervised object localization.” CVPR, 2018.
[20] Jo, Sanhyun, and In-Jae Yu. “Puzzle-CAM: Improved localization via matching partial and full features.” arXiv preprint arXiv:2101.11253 (2021).
[21] Araslanov, Nikita, and Stefan Roth. “Single-stage semantic segmentation from image labels.” CVPR, 2020.
Artistic and Photorealistic Style Transfer
Transfer strategy
Color transfer: Learn color transformation (explicit function or implicit function (e.g., look-up table) conditioned on color values, location, semantic information, or other guidance.
Compare Different Backbones
Losses:
- paired supervision: L2 loss
- unpaired supervision: adversarial loss
- smooth loss: variation loss, Poisson loss
- content loss: perception loss
- style loss: Gram loss, AdaIn loss
Multi-scale stylization
Reference
[1] Huang, Xun, and Serge Belongie. “Arbitrary style transfer in real-time with adaptive instance normalization.” ICCV, 2017.
[2] Li, Yijun, et al. “Universal style transfer via feature transforms.” NeurIPS, 2017.
[3] Park, Dae Young, and Kwang Hee Lee. “Arbitrary style transfer with style-attentional networks.” CVPR, 2019.
[4] Liu, Songhua, et al. “Adaattn: Revisit attention mechanism in arbitrary neural style transfer.” ICCV, 2021.
[5] Xia, Xide, et al. “Joint bilateral learning for real-time universal photorealistic style transfer.” ECCV, 2020.
[6] Wang, Pei, Yijun Li, and Nuno Vasconcelos. “Rethinking and improving the robustness of image style transfer.” CVPR, 2021.
[7] Wei, Hua-Peng, et al. “A Comparative Study of CNN-and Transformer-Based Visual Style Transfer.” Journal of Computer Science and Technology 37.3 (2022): 601-614.
Aesthetic Evaluation and Cropping
Adversarial Attack
A comprehensive survey can be found here.
Terminology:
black-box/white-box attack: the adversarial example is generated with or without knowing the prior knowledge of the target model.
targeted/non-targeted attack: whether predicting a specific label for the adversarial example.
universal perturbation: fool a given model on any image with high probability.
Attack
Backward Update
add imperceptible distortion and increase the classification loss
universal adversarial perturbation: learn a residual perturbation that works on most clean images
Forward Update
one-pixel attack: use differential evolution algorithm
Adversarial Transformation Networks: learn a network to translate clean image to adversarial example.
Defense
Use modified training samples during training or modified test samples during testing
Modify network: model parameters regularization, add a layer/module
Adversarial example detector: classify an example as adversarial or clean based on certain statistics
New perspective
3D Photography
Reference
[1] Shih, Meng-Li, et al. “3d photography using context-aware layered depth inpainting.” CVPR, 2020.
[2] Tucker, Richard, and Noah Snavely. “Single-view view synthesis with multiplane images.” CVPR, 2020.
[3] Li, Jiaxin, et al. “Mine: Towards continuous depth mpi with nerf for novel view synthesis.” ICCV, 2021.
[4] Niklaus, Simon, et al. “3d ken burns effect from a single image.” ACM Transactions on Graphics (TOG) 38.6 (2019): 1-15.
Makefile
Notes
use nemiver to debug.
- gcc/g++ -g hello.c -o hello.o #-g for debug
- nemiver hello #bin file
The comment character # does not introduce a make comment in the text of commands.
Wildcards: . expands to all the files containing a period. A question mark represents any single character, and […] represents a character class.
.PHONY: clean
Automatic Variables:
- $@ The name of the current target.
- $% The filename element of an archive member specification.
- $< The name of the first prerequisite.
- $? The names of all prerequisites that are newer than the target, separated by spaces.
- $^ The names of all the prerequisites, separated by spaces. This list has duplicate names removed since for most uses, such as compiling, copying, etc., duplicates are not wanted.
- $+ The names of all the prerequisites separated by spaces, including duplicates. This variable was created for specific situations such as arguments to linkers where duplicate values have meaning.
- $* The stem of the target filename. A stem is typically a filename without its suffix. Its use outside of pattern rules is discouraged.
run makefile with —just-print option to view the execution process
How to write Makefile
- single C-file
1
2hello: hello.c
gcc -g hello.c -o hello</code></pre> multiple C-files
1
2
3
4
5
6
7
8count_words: count_words.o lexer.o -lfl
gcc count_words.o lexer.o -lfl -ocount_words
count_words.o: count_words.c
gcc -g -c count_words.c
lexer.o: lexer.c
gcc -g -c lexer.c
lexer.c: lexer.l
flex -t lexer.l > lexer.cset VPATH and CPPFLAGS in implicit rules
1
2
3
4
5
6
7VPATH = src include
CPPFLAGS = -I include
count_words: counter.o lexer.o -lfl
count_words.o: counter.h
counter.o: counter.h lexer.h
lexer.o: lexer.hVPATH can be used in a more advanced fashion as follows,
1
2
3vpath %.c src
vpath %.l src
vpath %.h include- Use library .a. pack .o files into .a, similar as .lib in Windows.
1
2
3
4
5libcounter.a: libcounter.a(lexer.o) libcounter.a(counter.o)
libcounter.a(lexer.o): lexer.o
$(AR) $(ARFLAGS) $@ $<
libcounter.a(counter.o): counter.o
$(AR) $(ARFLAGS) $@ $<