(Object+Text)-Guided

Training-free

  • Pengzhi Li, Qiang Nie, Ying Chen, Xi Jiang, Kai Wu, Yuhuan Lin, Yong Liu, Jinlong Peng, Chengjie Wang, Feng Zheng: “Tuning-Free Image Customization with Image and Text Guidance.“ arXiv preprint arXiv:2403.12658 (2024) [arXiv]

    Training-based

  • Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yunzhi Zhuge, Xu Jia, Huchuan Lu: “DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting.“ arXiv preprint arXiv:2411.17223 (2024) [arXiv] [code]
  • Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C.K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou: “DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models.“ arXiv preprint arXiv:2312.03771 (2023) [arXiv]
  • Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang: “Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance.“ arXiv preprint arXiv:2403.19534 (2024) [arXiv] [code]

Foreground: 3D; Background: image

  • Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht: “Scene-Conditional 3D Object Stylization and Composition.“ arXiv preprint arXiv:2312.12419 (2023) [arXiv] [code]

Foreground: 3D; Background: 3D

  • Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari: “InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes.“ arXiv preprint arXiv:2401.05335 (2024) [arXiv]
  • Rahul Goel, Dhawal Sirikonda, Saurabh Saini, PJ Narayanan: “Interactive Segmentation of Radiance Fields.“ CVPR (2023) [arXiv] [code]
  • Rahul Goel, Dhawal Sirikonda, Rajvi Shah, PJ Narayanan: “FusedRF: Fusing Multiple Radiance Fields.“ CVPR Workshop (2023) [arXiv]
  • Verica Lazova, Vladimir Guzov, Kyle Olszewski, Sergey Tulyakov, Gerard Pons-Moll: “Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation.“ WACV (2023) [arXiv]
  • Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng: “Compressible-composable NeRF via Rank-residual Decomposition.“ NIPS (2022) [arXiv] [code]
  • Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han Zhou, Hujun Bao, Guofeng Zhang, Zhaopeng Cui: “Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering.“ ICCV (2021) [arXiv] [code]

Foreground: video; Background: image

  • Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang: “ActAnywhere: Subject-Aware Video Background Generation.“ arXiv preprint arXiv:2401.10822 (2024) [arXiv]

Foreground: video; Background: video

  • Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song: “Training-Free Semantic Video Composition via Pre-trained Diffusion Model.“ arXiv preprint arXiv:2401.09195 (2024) [arXiv]

  • Donghoon Lee, Tomas Pfister, Ming-Hsuan Yang: “Inserting Videos into Videos.“ CVPR (2019) [pdf]

Approaches

  1. Corneal reflection-based methods

    • NIR or LED illumination, learning the mapping (e.g., regression, ) between glint vector and gaze direction.
  2. Appearance based methods

    • Limbus model [pdf]: fit a limbus model (a fixed-diameter disc) to detected iris edges.

Auxiliary Tools

  1. Calibration: obtain the visual axis and kappa angle for each person.

  2. Facial landmarks detection

    • One Millisecond Face Alignment with an Ensemble of Regression Trees [pdf] [code]
    • Continuous Conditional Neural Fields for Structured Regression [pdf]
  3. Head Pose Estimation

Dataset

  1. [MPIIGaze]: fine-grained annotation

  2. [Eyediap]: RGB-D

Object Detection:

  1. image label: [WSDDN]

  2. points that indicate the location of the object

  3. bounding boxes

Segmentation:

  1. image label: [SEC]

  2. points that indicate the location of the object

  3. scribbles that imply the extent of the object

  4. bounding boxes

  5. segmentation masks

  1. Distinguish generated fake images and real images in the freqency domain. [2]

  2. Use frequency map as network input or output [1] [5] [6]

  3. Use intermediate frequency features [7] [9]

  4. An image can be composed of or decomposed into low-frequency part and high-frequency part [3] [8] [4] [10]

Reference

  1. Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, Fengbo Ren, “Learning in the Frequency Domain”, CVPR, 2020.

  2. Wang, Sheng-Yu, et al. “CNN-generated images are surprisingly easy to spot… for now.” arXiv preprint arXiv:1912.11035 (2019).

  3. ayush Bansal, Yaser Sheikh, Deva Ramanan, “PixelNN: Example-based Image Synthesis”, ICLR 2018.

  4. Yanchao Yang, Stefano Soatto, “FDA: Fourier Domain Adaptation for Semantic Segmentation”, CVPR 2020.

  5. Roy, Hiya, et al. “Image inpainting using frequency domain priors.” arXiv preprint arXiv:2012.01832 (2020).

  6. Shen, Xing, et al. “DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation.” arXiv preprint arXiv:2011.09876 (2020).

  7. Suvorov, Roman, et al. “Resolution-robust Large Mask Inpainting with Fourier Convolutions.” WACV (2021).

  8. Yu, Yingchen, et al. “WaveFill: A Wavelet-based Generation Network for Image Inpainting.” ICCV, 2021.

  9. Mardani, Morteza, et al. “Neural ffts for universal texture image synthesis.” NeurIPS (2020).

  10. Cai, Mu, et al. “Frequency domain image translation: More photo-realistic, better identity-preserving.” ICCV, 2021.

  1. Predict visual feature of one future frame [1]

  2. Predict optical flow of one future frame [2]

  3. Predict one future frame [4] (a special case of video prediction)

  4. Predict future trajectories [5]

  5. Predict optical flows of future frames, and then obtain future frames [3]

Reference

  1. Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. “Anticipating visual representations from unlabeled video.” CVPR, 2016.

  2. Gao, Ruohan, Bo Xiong, and Kristen Grauman. “Im2flow: Motion hallucination from static images for action recognition.” CVPR, 2018.

  3. Li, Yijun, et al. “Flow-grounded spatial-temporal video prediction from still images.” ECCV, 2018.

  4. Xue, Tianfan, et al. “Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks.” NIPS, 2016.

  5. Walker, Jacob, et al. “An uncertain future: Forecasting from static images using variational autoencoders.” ECCV, 2016.

Few-shot Feature Generation

  1. Meta-learning method: [1]

  2. Delta-based: delta between each pair of samples [2]; delta between each sample and class center [3] [4]

Reference

[1] Zhang, Ruixiang, et al. “Metagan: An adversarial approach to few-shot learning.” NIPS, 2018.

[2] Schwartz, Eli, et al. “Delta-encoder: an effective sample synthesis method for few-shot object recognition.” Advances in Neural Information Processing Systems. 2018.

[3] Liu, Jialun, et al. “Deep Representation Learning on Long-tailed Data: A Learnable Embedding Augmentation Perspective.” arXiv preprint arXiv:2002.10826 (2020).

[4] Yin, Xi, et al. “Feature transfer learning for face recognition with under-represented data.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

  1. Light-weighted network structure

    SqueezeNet, MobileNet, and ShuffleNet share the same idea: decouple the temporal convolution and spatial convolution to reduce the nummber of parameters, sharing the similar spirit with Pseudo-3D Residual Networks. SqueezeNet is serial while MobileNet and ShuffleNet are parrallel. MobileNet is a special case of ShuffleNet when using only one group.

    Low-rank approximation ($k\times k \times c\times d = k\times k\times c\times d’ + 1\times 1\times d’\times d$) also falls into the above scope. The difference between MobileNet and Low-rank approximation is layerwise convolution or not.

  2. Tweak network structure

    • prune nodes based on certain criteria (e.g., response value, Fisher information): require special implementation and take up more space than expected due to irregular network structure.
  3. Compress weights

    • Quantization (fixed bit number): learn codebook and encode weights. Fine-tune codebook after quantizatizing weights, which averages the gradient of weights belonging to the same cluster. Extreme cases are binary net and ternary net. Binary (resp, ternary) net are quantized to [-1, 1] (resp, [-1, 0, 1]), with different weights $\alpha$ for different layers.
    • Huffman Coding (flexible bit number): applied after quantization for further compression.
  4. Computation

    • spatial domain to frequency domain: convert convolution to pointwise multiplication by using FFT
  5. Sparsity regularization

  6. Efficient Inference

    • cascade of networks, early exit network (predict whether to exit or not after each layer) [1] [2]

Good introduction slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture15.pdf

Multi-scale fusion: HED [1], RCF [2]

Reference

  1. Xie, Saining, and Zhuowen Tu. “Holistically-nested edge detection.” ICCV, 2015.

  2. Liu, Yun, et al. “Richer convolutional features for edge detection.” CVPR, 2017.

Dynamic kernels: [1] [2]

Survey: [Dynamic neural networks: A survey]

References

  1. Jia, Xu, et al. “Dynamic filter networks.” Advances in neural information processing systems 29 (2016).

  2. Tian, Zhi, Chunhua Shen, and Hao Chen. “Conditional convolutions for instance segmentation.” European conference on computer vision. Springer, Cham, 2020.

a) When the domain labels are known:

  • reduce the distance between different domains: MMD [1][2], mutual information

  • domain-invariant and domain-specific components: [1][2]

b) When the domain labels are unkown:

  • first discover multiple latent domains: cluster [1][2], max margin separation [1]
0%