Newly Blog

Interpretable Machine Learning

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Manipulate each layer/neuron, and observe the change of network parameters/activations.
Saliency map
Adversarial attack
Correlation
Information gain/loss

Image-text Retrieval

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Webly supervised image-text retrieval

The first work [1] on using web images and their tags to augment image-sentence pairs. We try to reproduce it, but it does not work at all.

The text associated with a web image generally consists of tags, title, and description.
The tags are very noisy, but they are acceptable for webly supervised image classification. The titles and descriptions are noisier. Only a few descriptions are complete sentences and match the corresponding images.

Conceptual caption dataset [2] crawled web images and their alt text, and developed an automatic pipeline that extracts, filters, and transforms candidate image-caption pairs, resulting in relatively clean image-text pairs. This large corpus of web image-text pairs can be used for pretraining image-text retrieval model or image captioning model.

Image-text (Chinse) Datasets

AI challenger 2017: training set code:asix validation set code:dcnn
f30k-cn

Reference

[1] Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury:
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. ACM MM, 2018.

[2] Sharma, Piyush, et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.” ACL, 2018.

Image Matting

Posted on 2026-03-17 Edited on 2023-03-30 In paper note

Background

The target is separating foreground from background given some user annotation (e.g., trimask, scribble). The prevalent technique alpha matting is to solve $\mathbf{\alpha}$ (primary target), $\mathbf{F}$, $\mathbf{B}$ (subordinate target) in $\mathbf{I}=\mathbf{\alpha}\circ\mathbf{F}+(1-\mathbf{\alpha})\circ \mathbf{B}$ [1] [2] [3].

Datasets

Alphamatting.com Dataset: 25 train images, 8 test images, each has 3 different trimaps: small, large, user. Input: image and trimap.
Composition-1k Dataset: 1000 images and 50 unique foregrounds.
Matting Human Dataset: 34427 images, annotation is not very accurate.
Dinstinctions-646: composed of 646 foreground images
Text matting dataset

Evaluation metrics

quantitative: Sum of Absolute Differences (SAD), Mean Square Error (MSE), Gradient error, Connectivity error.

Methods

Affinity-based [1]: pixel similarity metrics that rely on color similarity or spatial proximity.
Sampling-based [8]: the foreground/background color of unknown pixels can be obtained by sampling the foreground/background color of known pixels.
Learning-based
- With trimap:
  - Encoder-Decoder network [2] is the first end-to-end method for image matting: input image and trimap, output alpha; alpha loss and compositional loss; refine alpha.
  - DeepMattePropNet [4]: use deep learning to approximate affinity-based matting method; compositional loss.
  - AlphaGAN [6]: combine GAN with alpha loss and compositional loss.
  - Learning based sampling [7]
- Without trimap:
  - Light Dense Network (LDN) + Feathering Block (FB) [3]: generate segmentation mask and refine the mask with feathering block.
  - T-Net+M-net [5]: use segmentation task as trimap
  - [9]: capture the background image without subject and a corresponding video with subject

Losses

gradient loss [11] Laplacian loss [12]

Extension

Omnimatte [10]: segment objects and scene effects related to the objects (shadows, reflections, smoke)

User-guided Image Matting

unified interactive image matting: [13]

Reference:

[1] Aksoy, Yagiz, Tunc Ozan Aydin, and Marc Pollefeys. “Designing effective inter-pixel information flow for natural image matting.” CVPR, 2017.

[2] Xu, Ning, et al. “Deep image matting.” CVPR, 2017.

[3] Zhu, Bingke, et al. “Fast deep matting for portrait animation on mobile phone.” ACM MM, 2017.

[4] Wang, Yu, et al. “Deep Propagation Based Image Matting.” IJCAI. 2018.

[5] Quan Chen, Tiezheng Ge, Yanyu Xu, Zhiqiang Zhang, Xinxin Yang, Kun Gai, “Semantic Human Matting.” ACM MM, 2018.

[6] Lutz, Sebastian, Konstantinos Amplianitis, and Aljosa Smolic. “AlphaGAN: Generative adversarial networks for natural image matting.” BMVC, 2018.

[7] Jingwei Tang, Yagız Aksoy, Cengiz Oztireli, Markus Gross, Tunc Ozan Aydın. “Learning-based Sampling for Natural Image Matting”, CVPR, 2019.

[8] Feng, Xiaoxue, Xiaohui Liang, and Zili Zhang. “A cluster sampling method for image matting via sparse coding.” ECCV, 2016.

[9] Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman:
Background Matting: The World is Your Green Screen. CVPR, 2020.

[10] Lu, Erika, et al. “Omnimatte: Associating Objects and Their Effects in Video.” CVPR, 2021.

[11] Zhang, Yunke, et al. “A late fusion cnn for digital matting.” CVPR, 2019.

[12] Hou, Qiqi, and Feng Liu. “Context-aware image matting for simultaneous foreground and alpha estimation.” ICCV. 2019.

[13] Yang, Stephen, et al. “Unified interactive image matting.” arXiv preprint arXiv:2205.08324 (2022).

Image Manipulation Detection

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Datasets

Casia V1.0/2.0: [paper] [v1] [v2]
Columbia Uncompressed Image Splicing Detection: [dataset]
Pawel korus-Realistic Tampering Dataset: [dataset]
Coverage: [dataset]
NIST16: [dataset]
DEFACTO: [dataset]
self-made dataset with [code]

Methods

MantraNet [code]: compare each pixel with neighboring pixels
MAGritte [code]: a combination of generation and discrimination
H-LSTM [paper] [code]: 1. resampling features 2. use Hilbert curve to determine the patch order
Constrained-RCNN [code]: constrained convolution
GSRNet [paper] [code]: data augmentation
SPAN [code]: pyramid self-attention

Image Loss

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

perceptual loss [1]: two images have similar semantic information
$\frac{1}{C_j H_j W_j}||\phi_j(\hat{x})-\phi_j(x)||^2$
style loss [2]: two images have similar channel correlation; related to bilinear pooling [6]
$||G_j^{\phi}(\hat{x})-G_j^{\phi}(x)||_F^2$
with $G_j^{\phi}(x)_{c,c'}=\frac{1}{C_j H_j W_j}\sum_{h=1}^{H_j}\sum_{w=1}^{W_j}\phi_j(x)_{h,w,c}\phi_j(x)_{h,w,c'}$
pairwise mean squared error (PMSE) [3] [4]: scale-invariant mean squared error (in log space)
$\frac{1}{n}\sum_i d_i^2 - \frac{1}{n^2}(\sum_i d_i)^2$
total variation (TV) loss [1]: smoothness
$\sum_{(i,j)} ||x_{i,j+1}-x_{i,j}||_1 +||x_{i+1,j}-x_{i,j}||_1$
alignment loss [5]: two images have similar spatial correlation, complementary to style loss
$||F_j^{\phi}(\hat{x})-F_j^{\phi}(x)||_F^2$
with $F_j^{\phi}(x)_{d,d'}=\frac{1}{C_j H_j W_j}\sum_{c=1}^{C}\phi_j(x)_{d,c}\phi_j(x)_{d',c}$

Reference

[1] Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual losses for real-time style transfer and super-resolution.” ECCV, 2016.

[2] Gatys, Leon, Alexander S. Ecker, and Matthias Bethge. “Texture synthesis using convolutional neural networks.” NIPS, 2015.

[3] Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” NIPS, 2014.

[4] Bousmalis, Konstantinos, et al. “Unsupervised pixel-level domain adaptation with generative adversarial networks.” CVPR, 2017.

[5] Abavisani, Mahdi, Hamid Reza Vaezi Joze, and Vishal M. Patel. “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training.” CVPR, 2019.

[6] Lin, Tsung-Yu, Aruni RoyChowdhury, and Subhransu Maji. “Bilinear cnn models for fine-grained visual recognition.” ICCV, 2015.

Image Action Recognition with Unlabeled Videos

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Self-supervised learning: see video-to-image in this blog.
predict optical flow and use two-stream network [1]
Predicting pose information (use poselet detector) [2]

Reference:

[1] Gao, Ruohan, Bo Xiong, and Kristen Grauman. “Im2flow: Motion hallucination from static images for action recognition.” CVPR, 2018.

[2] Chen, Chao-Yeh, and Kristen Grauman. “Watching unlabeled video helps learn new human actions from very few labeled snapshots.” CVPR, 2013.

Human Generation

Posted on 2026-03-17 Edited on 2024-05-20 In paper note

Combine different components: [1] [2]

References

Frühstück, Anna, et al. “Insetgan for full-body image generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Huang, Zehuan, et al. “From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation.” arXiv preprint arXiv:2404.15267 (2024).

Gradient Regularization

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Gradient harmonization: [1]

[1] Gradient Harmonized Single-stage Detector, AAAI, 2019

Geometry-aware Deep Feature

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Geometry feature generation based on unsupervisely detected landmarks. [1]
Disentangle bottleneck features into category-invariant features and category-specific features. Category-invariant features encode the pose information.

Reference

Wayne Wu, Kaidi Cao, Cheng Li, Chen Qian, Chen Change Loy: TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation. CVPR 2019

Generative Composition

Posted on 2026-03-17 Edited on 2025-01-06 In paper note

(Object+Text)-Guided

Training-free

Pengzhi Li, Qiang Nie, Ying Chen, Xi Jiang, Kai Wu, Yuhuan Lin, Yong Liu, Jinlong Peng, Chengjie Wang, Feng Zheng: “Tuning-Free Image Customization with Image and Text Guidance.“ arXiv preprint arXiv:2403.12658 (2024) [arXiv]
Training-based
Yicheng Yang, Pengxiang Li, Lu Zhang, Liqian Ma, Ping Hu, Siyu Du, Yunzhi Zhuge, Xu Jia, Huchuan Lu: “DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting.“ arXiv preprint arXiv:2411.17223 (2024) [arXiv] [code]
Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C.K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou: “DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models.“ arXiv preprint arXiv:2312.03771 (2023) [arXiv]
Yulin Pan, Chaojie Mao, Zeyinzi Jiang, Zhen Han, Jingfeng Zhang: “Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance.“ arXiv preprint arXiv:2403.19534 (2024) [arXiv] [code]

Foreground: 3D; Background: image

Jinghao Zhou, Tomas Jakab, Philip Torr, Christian Rupprecht: “Scene-Conditional 3D Object Stylization and Composition.“ arXiv preprint arXiv:2312.12419 (2023) [arXiv] [code]

Foreground: 3D; Background: 3D

Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari: “InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes.“ arXiv preprint arXiv:2401.05335 (2024) [arXiv]
Rahul Goel, Dhawal Sirikonda, Saurabh Saini, PJ Narayanan: “Interactive Segmentation of Radiance Fields.“ CVPR (2023) [arXiv] [code]
Rahul Goel, Dhawal Sirikonda, Rajvi Shah, PJ Narayanan: “FusedRF: Fusing Multiple Radiance Fields.“ CVPR Workshop (2023) [arXiv]
Verica Lazova, Vladimir Guzov, Kyle Olszewski, Sergey Tulyakov, Gerard Pons-Moll: “Control-NeRF: Editable Feature Volumes for Scene Rendering and Manipulation.“ WACV (2023) [arXiv]
Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng: “Compressible-composable NeRF via Rank-residual Decomposition.“ NIPS (2022) [arXiv] [code]
Bangbang Yang, Yinda Zhang, Yinghao Xu, Yijin Li, Han Zhou, Hujun Bao, Guofeng Zhang, Zhaopeng Cui: “Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering.“ ICCV (2021) [arXiv] [code]

Foreground: video; Background: image

Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang: “ActAnywhere: Subject-Aware Video Background Generation.“ arXiv preprint arXiv:2401.10822 (2024) [arXiv]

Foreground: video; Background: video

Jiaqi Guo, Sitong Su, Junchen Zhu, Lianli Gao, Jingkuan Song: “Training-Free Semantic Video Composition via Pre-trained Diffusion Model.“ arXiv preprint arXiv:2401.09195 (2024) [arXiv]
Donghoon Lee, Tomas Pfister, Ming-Hsuan Yang: “Inserting Videos into Videos.“ CVPR (2019) [pdf]

Webly supervised image-text retrieval

Image-text (Chinse) Datasets

Reference

Background

Datasets

Evaluation metrics

Methods

Losses

Extension

User-guided Image Matting

Reference:

Datasets

Methods

Reference

Reference:

References

Reference

(Object+Text)-Guided

Training-free

Training-based

Foreground: 3D; Background: image

Foreground: 3D; Background: 3D

Foreground: video; Background: image

Foreground: video; Background: video