Background

The target is separating foreground from background given some user annotation (e.g., trimask, scribble). The prevalent technique alpha matting is to solve $\mathbf{\alpha}$ (primary target), $\mathbf{F}$, $\mathbf{B}$ (subordinate target) in $\mathbf{I}=\mathbf{\alpha}\circ\mathbf{F}+(1-\mathbf{\alpha})\circ \mathbf{B}$ [1] [2] [3].

Datasets

Evaluation metrics

  • quantitative: Sum of Absolute Differences (SAD), Mean Square Error (MSE), Gradient error, Connectivity error.

Methods

  1. Affinity-based [1]: pixel similarity metrics that rely on color similarity or spatial proximity.

  2. Sampling-based [8]: the foreground/background color of unknown pixels can be obtained by sampling the foreground/background color of known pixels.

  3. Learning-based

    • With trimap:
      • Encoder-Decoder network [2] is the first end-to-end method for image matting: input image and trimap, output alpha; alpha loss and compositional loss; refine alpha.
      • DeepMattePropNet [4]: use deep learning to approximate affinity-based matting method; compositional loss.
      • AlphaGAN [6]: combine GAN with alpha loss and compositional loss.
      • Learning based sampling [7]
    • Without trimap:
      • Light Dense Network (LDN) + Feathering Block (FB) [3]: generate segmentation mask and refine the mask with feathering block.
      • T-Net+M-net [5]: use segmentation task as trimap
      • [9]: capture the background image without subject and a corresponding video with subject

Losses

gradient loss [11] Laplacian loss [12]

Extension

Omnimatte [10]: segment objects and scene effects related to the objects (shadows, reflections, smoke)

User-guided Image Matting

unified interactive image matting: [13]

Reference:

[1] Aksoy, Yagiz, Tunc Ozan Aydin, and Marc Pollefeys. “Designing effective inter-pixel information flow for natural image matting.” CVPR, 2017.

[2] Xu, Ning, et al. “Deep image matting.” CVPR, 2017.

[3] Zhu, Bingke, et al. “Fast deep matting for portrait animation on mobile phone.” ACM MM, 2017.

[4] Wang, Yu, et al. “Deep Propagation Based Image Matting.” IJCAI. 2018.

[5] Quan Chen, Tiezheng Ge, Yanyu Xu, Zhiqiang Zhang, Xinxin Yang, Kun Gai, “Semantic Human Matting.” ACM MM, 2018.

[6] Lutz, Sebastian, Konstantinos Amplianitis, and Aljosa Smolic. “AlphaGAN: Generative adversarial networks for natural image matting.” BMVC, 2018.

[7] Jingwei Tang, Yagız Aksoy, Cengiz Oztireli, Markus Gross, Tunc Ozan Aydın. “Learning-based Sampling for Natural Image Matting”, CVPR, 2019.

[8] Feng, Xiaoxue, Xiaohui Liang, and Zili Zhang. “A cluster sampling method for image matting via sparse coding.” ECCV, 2016.

[9] Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman:
Background Matting: The World is Your Green Screen. CVPR, 2020.

[10] Lu, Erika, et al. “Omnimatte: Associating Objects and Their Effects in Video.” CVPR, 2021.

[11] Zhang, Yunke, et al. “A late fusion cnn for digital matting.” CVPR, 2019.

[12] Hou, Qiqi, and Feng Liu. “Context-aware image matting for simultaneous foreground and alpha estimation.” ICCV. 2019.

[13] Yang, Stephen, et al. “Unified interactive image matting.” arXiv preprint arXiv:2205.08324 (2022).

Datasets

Methods

  • MantraNet [code]: compare each pixel with neighboring pixels

  • MAGritte [code]: a combination of generation and discrimination

  • H-LSTM [paper] [code]: 1. resampling features 2. use Hilbert curve to determine the patch order

  • Constrained-RCNN [code]: constrained convolution

  • GSRNet [paper] [code]: data augmentation

  • SPAN [code]: pyramid self-attention

  1. perceptual loss [1]: two images have similar semantic information

  2. style loss [2]: two images have similar channel correlation; related to bilinear pooling [6]

    with

  3. pairwise mean squared error (PMSE) [3] [4]: scale-invariant mean squared error (in log space)

  4. total variation (TV) loss [1]: smoothness

  5. alignment loss [5]: two images have similar spatial correlation, complementary to style loss

    with

Reference

[1] Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual losses for real-time style transfer and super-resolution.” ECCV, 2016.

[2] Gatys, Leon, Alexander S. Ecker, and Matthias Bethge. “Texture synthesis using convolutional neural networks.” NIPS, 2015.

[3] Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” NIPS, 2014.

[4] Bousmalis, Konstantinos, et al. “Unsupervised pixel-level domain adaptation with generative adversarial networks.” CVPR, 2017.

[5] Abavisani, Mahdi, Hamid Reza Vaezi Joze, and Vishal M. Patel. “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training.” CVPR, 2019.

[6] Lin, Tsung-Yu, Aruni RoyChowdhury, and Subhransu Maji. “Bilinear cnn models for fine-grained visual recognition.” ICCV, 2015.

  1. Self-supervised learning: see video-to-image in this blog.

  2. predict optical flow and use two-stream network [1]

  3. Predicting pose information (use poselet detector) [2]

Reference:

[1] Gao, Ruohan, Bo Xiong, and Kristen Grauman. “Im2flow: Motion hallucination from static images for action recognition.” CVPR, 2018.

[2] Chen, Chao-Yeh, and Kristen Grauman. “Watching unlabeled video helps learn new human actions from very few labeled snapshots.” CVPR, 2013.

Combine different components: [1] [2]

References

  1. Frühstück, Anna, et al. “Insetgan for full-body image generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

  2. Huang, Zehuan, et al. “From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation.” arXiv preprint arXiv:2404.15267 (2024).

  1. Geometry feature generation based on unsupervisely detected landmarks. [1]

  2. Disentangle bottleneck features into category-invariant features and category-specific features. Category-invariant features encode the pose information.

Reference

  1. Wayne Wu, Kaidi Cao, Cheng Li, Chen Qian, Chen Change Loy: TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation. CVPR 2019

(Object+Text)-Guided

Training-free

  • Pengzhi Li, Qiang Nie, Ying Chen, Xi Jiang, Kai Wu, Yuhuan Lin, Yong Liu, Jinlong Peng, Chengjie Wang, Feng Zheng: “Tuning-Free Image Customization with Image and Text Guidance.“ arXiv preprint arXiv:2403.12658 (2024) [arXiv]

    Training-based

  • Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yunzhi Zhuge, Xu Jia, Huchuan Lu: “DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting.“ arXiv preprint arXiv:2411.17223 (2024) [arXiv] [code]
  • Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C.K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou: “DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models.“ arXiv preprint arXiv:2312.03771 (2023) [arXiv]
  • Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang: “Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance.“ arXiv preprint arXiv:2403.19534 (2024) [arXiv] [code]

Foreground: 3D; Background: image

  • Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht: “Scene-Conditional 3D Object Stylization and Composition.“ arXiv preprint arXiv:2312.12419 (2023) [arXiv] [code]

Foreground: 3D; Background: 3D

  • Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari: “InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes.“ arXiv preprint arXiv:2401.05335 (2024) [arXiv]
  • Rahul Goel, Dhawal Sirikonda, Saurabh Saini, PJ Narayanan: “Interactive Segmentation of Radiance Fields.“ CVPR (2023) [arXiv] [code]
  • Rahul Goel, Dhawal Sirikonda, Rajvi Shah, PJ Narayanan: “FusedRF: Fusing Multiple Radiance Fields.“ CVPR Workshop (2023) [arXiv]
  • Verica Lazova, Vladimir Guzov, Kyle Olszewski, Sergey Tulyakov, Gerard Pons-Moll: “Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation.“ WACV (2023) [arXiv]
  • Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng: “Compressible-composable NeRF via Rank-residual Decomposition.“ NIPS (2022) [arXiv] [code]
  • Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han Zhou, Hujun Bao, Guofeng Zhang, Zhaopeng Cui: “Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering.“ ICCV (2021) [arXiv] [code]

Foreground: video; Background: image

  • Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang: “ActAnywhere: Subject-Aware Video Background Generation.“ arXiv preprint arXiv:2401.10822 (2024) [arXiv]

Foreground: video; Background: video

  • Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song: “Training-Free Semantic Video Composition via Pre-trained Diffusion Model.“ arXiv preprint arXiv:2401.09195 (2024) [arXiv]

  • Donghoon Lee, Tomas Pfister, Ming-Hsuan Yang: “Inserting Videos into Videos.“ CVPR (2019) [pdf]

Approaches

  1. Corneal reflection-based methods

    • NIR or LED illumination, learning the mapping (e.g., regression, ) between glint vector and gaze direction.
  2. Appearance based methods

    • Limbus model [pdf]: fit a limbus model (a fixed-diameter disc) to detected iris edges.

Auxiliary Tools

  1. Calibration: obtain the visual axis and kappa angle for each person.

  2. Facial landmarks detection

    • One Millisecond Face Alignment with an Ensemble of Regression Trees [pdf] [code]
    • Continuous Conditional Neural Fields for Structured Regression [pdf]
  3. Head Pose Estimation

Dataset

  1. [MPIIGaze]: fine-grained annotation

  2. [Eyediap]: RGB-D

Object Detection:

  1. image label: [WSDDN]

  2. points that indicate the location of the object

  3. bounding boxes

Segmentation:

  1. image label: [SEC]

  2. points that indicate the location of the object

  3. scribbles that imply the extent of the object

  4. bounding boxes

  5. segmentation masks

0%