1. Reorganize patches [1]

  2. Reorganize pixels [2]

Reference

[1] Noroozi, Mehdi, and Paolo Favaro. “Unsupervised learning of visual representations by solving jigsaw puzzles.” ECCV, 2016.

[2] Shen, Wan Xiang, et al. “AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks.” Nucleic Acids Research (2022).

  1. Manipulate each layer/neuron, and observe the change of network parameters/activations.

  2. Saliency map

  3. Adversarial attack

  4. Correlation

  5. Information gain/loss

Webly supervised image-text retrieval

The first work [1] on using web images and their tags to augment image-sentence pairs. We try to reproduce it, but it does not work at all.

The text associated with a web image generally consists of tags, title, and description.
The tags are very noisy, but they are acceptable for webly supervised image classification. The titles and descriptions are noisier. Only a few descriptions are complete sentences and match the corresponding images.

Conceptual caption dataset [2] crawled web images and their alt text, and developed an automatic pipeline that extracts, filters, and transforms candidate image-caption pairs, resulting in relatively clean image-text pairs. This large corpus of web image-text pairs can be used for pretraining image-text retrieval model or image captioning model.

Image-text (Chinse) Datasets

Reference

[1] Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury:
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. ACM MM, 2018.

[2] Sharma, Piyush, et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.” ACL, 2018.

Background

The target is separating foreground from background given some user annotation (e.g., trimask, scribble). The prevalent technique alpha matting is to solve $\mathbf{\alpha}$ (primary target), $\mathbf{F}$, $\mathbf{B}$ (subordinate target) in $\mathbf{I}=\mathbf{\alpha}\circ\mathbf{F}+(1-\mathbf{\alpha})\circ \mathbf{B}$ [1] [2] [3].

Datasets

Evaluation metrics

  • quantitative: Sum of Absolute Differences (SAD), Mean Square Error (MSE), Gradient error, Connectivity error.

Methods

  1. Affinity-based [1]: pixel similarity metrics that rely on color similarity or spatial proximity.

  2. Sampling-based [8]: the foreground/background color of unknown pixels can be obtained by sampling the foreground/background color of known pixels.

  3. Learning-based

    • With trimap:
      • Encoder-Decoder network [2] is the first end-to-end method for image matting: input image and trimap, output alpha; alpha loss and compositional loss; refine alpha.
      • DeepMattePropNet [4]: use deep learning to approximate affinity-based matting method; compositional loss.
      • AlphaGAN [6]: combine GAN with alpha loss and compositional loss.
      • Learning based sampling [7]
    • Without trimap:
      • Light Dense Network (LDN) + Feathering Block (FB) [3]: generate segmentation mask and refine the mask with feathering block.
      • T-Net+M-net [5]: use segmentation task as trimap
      • [9]: capture the background image without subject and a corresponding video with subject

Losses

gradient loss [11] Laplacian loss [12]

Extension

Omnimatte [10]: segment objects and scene effects related to the objects (shadows, reflections, smoke)

User-guided Image Matting

unified interactive image matting: [13]

Reference:

[1] Aksoy, Yagiz, Tunc Ozan Aydin, and Marc Pollefeys. “Designing effective inter-pixel information flow for natural image matting.” CVPR, 2017.

[2] Xu, Ning, et al. “Deep image matting.” CVPR, 2017.

[3] Zhu, Bingke, et al. “Fast deep matting for portrait animation on mobile phone.” ACM MM, 2017.

[4] Wang, Yu, et al. “Deep Propagation Based Image Matting.” IJCAI. 2018.

[5] Quan Chen, Tiezheng Ge, Yanyu Xu, Zhiqiang Zhang, Xinxin Yang, Kun Gai, “Semantic Human Matting.” ACM MM, 2018.

[6] Lutz, Sebastian, Konstantinos Amplianitis, and Aljosa Smolic. “AlphaGAN: Generative adversarial networks for natural image matting.” BMVC, 2018.

[7] Jingwei Tang, Yagız Aksoy, Cengiz Oztireli, Markus Gross, Tunc Ozan Aydın. “Learning-based Sampling for Natural Image Matting”, CVPR, 2019.

[8] Feng, Xiaoxue, Xiaohui Liang, and Zili Zhang. “A cluster sampling method for image matting via sparse coding.” ECCV, 2016.

[9] Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman:
Background Matting: The World is Your Green Screen. CVPR, 2020.

[10] Lu, Erika, et al. “Omnimatte: Associating Objects and Their Effects in Video.” CVPR, 2021.

[11] Zhang, Yunke, et al. “A late fusion cnn for digital matting.” CVPR, 2019.

[12] Hou, Qiqi, and Feng Liu. “Context-aware image matting for simultaneous foreground and alpha estimation.” ICCV. 2019.

[13] Yang, Stephen, et al. “Unified interactive image matting.” arXiv preprint arXiv:2205.08324 (2022).

Datasets

Methods

  • MantraNet [code]: compare each pixel with neighboring pixels

  • MAGritte [code]: a combination of generation and discrimination

  • H-LSTM [paper] [code]: 1. resampling features 2. use Hilbert curve to determine the patch order

  • Constrained-RCNN [code]: constrained convolution

  • GSRNet [paper] [code]: data augmentation

  • SPAN [code]: pyramid self-attention

  1. perceptual loss [1]: two images have similar semantic information

  2. style loss [2]: two images have similar channel correlation; related to bilinear pooling [6]

    with

  3. pairwise mean squared error (PMSE) [3] [4]: scale-invariant mean squared error (in log space)

  4. total variation (TV) loss [1]: smoothness

  5. alignment loss [5]: two images have similar spatial correlation, complementary to style loss

    with

Reference

[1] Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual losses for real-time style transfer and super-resolution.” ECCV, 2016.

[2] Gatys, Leon, Alexander S. Ecker, and Matthias Bethge. “Texture synthesis using convolutional neural networks.” NIPS, 2015.

[3] Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” NIPS, 2014.

[4] Bousmalis, Konstantinos, et al. “Unsupervised pixel-level domain adaptation with generative adversarial networks.” CVPR, 2017.

[5] Abavisani, Mahdi, Hamid Reza Vaezi Joze, and Vishal M. Patel. “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training.” CVPR, 2019.

[6] Lin, Tsung-Yu, Aruni RoyChowdhury, and Subhransu Maji. “Bilinear cnn models for fine-grained visual recognition.” ICCV, 2015.

  1. Self-supervised learning: see video-to-image in this blog.

  2. predict optical flow and use two-stream network [1]

  3. Predicting pose information (use poselet detector) [2]

Reference:

[1] Gao, Ruohan, Bo Xiong, and Kristen Grauman. “Im2flow: Motion hallucination from static images for action recognition.” CVPR, 2018.

[2] Chen, Chao-Yeh, and Kristen Grauman. “Watching unlabeled video helps learn new human actions from very few labeled snapshots.” CVPR, 2013.

Combine different components: [1] [2]

References

  1. Frühstück, Anna, et al. “Insetgan for full-body image generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

  2. Huang, Zehuan, et al. “From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation.” arXiv preprint arXiv:2404.15267 (2024).

  1. Geometry feature generation based on unsupervisely detected landmarks. [1]

  2. Disentangle bottleneck features into category-invariant features and category-specific features. Category-invariant features encode the pose information.

Reference

  1. Wayne Wu, Kaidi Cao, Cheng Li, Chen Qian, Chen Change Loy: TransGaGa: Geometry-Aware Unsupervised Image-To-Image Translation. CVPR 2019
0%