Image-text Retrieval

Posted on 2022-06-16 | In paper note

Webly supervised image-text retrieval

The first work [1] on using web images and their tags to augment image-sentence pairs. We try to reproduce it, but it does not work at all.

The text associated with a web image generally consists of tags, title, and description.
The tags are very noisy, but they are acceptable for webly supervised image classification. The titles and descriptions are noisier. Only a few descriptions are complete sentences and match the corresponding images.

Conceptual caption dataset [2] crawled web images and their alt text, and developed an automatic pipeline that extracts, filters, and transforms candidate image-caption pairs, resulting in relatively clean image-text pairs. This large corpus of web image-text pairs can be used for pretraining image-text retrieval model or image captioning model.

Image-text (Chinse) Datasets

AI challenger 2017: training set code:asix validation set code:dcnn
f30k-cn

Reference

[1] Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury:
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. ACM MM, 2018.

[2] Sharma, Piyush, et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.” ACL, 2018.

Image to Image Translation

Posted on 2022-06-16 | In paper note

Optimization-based:

texture synthesis
- Texture synthesis using convolutional neural networks. [pdf]
feature inversion
- Understanding Deep Image Representations by Inverting Them.
style transfer = feature inversion + texture synthesis
- Image style transfer using convolutional neural networks. [pdf] [code] (no training, test is slow)
- Perceptual Losses for Real-Time Style Transfer and Super-Resolution. [pdf] (train a network for each style using style image and content image as inputs, real-time test, belong to one-to-one image mapping)
- Texture Networks: Feed-forward Synthesis of Textures and Stylized Image. [pdf]
- A learned representation for artistic style. [pdf] (train a unified network for multiple styles)

Feedforward-based:

super-resolution
- Learning a deep convolutional network for image super-resolution. [pdf]
- Accurate Image Super-Resolution Using Very Deep Convolutional Networks [pdf] [code] (VGG learns residual)
- Accelerating the Super-Resolution Convolutional Neural Network. [pdf] (hourglass structure, deconv)
- Deeply-recursive convolutional network for image super-resolution. [pdf]
- Photo-realistic single image super-resolution using a generative adversarial network. [pdf] (content_loss, adversarial loss)

inpainting or hole-filling
- Deep Image Inpainting. [pdf]
- Context Encoders: Feature Learning by Inpainting [pdf] [code]
colorization
- Colorful image colorization. [pdf] [code]
- Learning Representations for Automatic Colorization. [pdf] [code
denoising
- Image Restoration Using Very Deep Convolutional EncoderDecoder Networks with Symmetric Skip Connections [pdf] [code]:(conv and deconv)
decompression
- Compression Artifacts Reduction by a Deep Convolutional Network [pdf]
dehaze/deraining
- Dehazenet: An end-to-end system for single image haze removal [pdf]
demosaicking
- Deep joint demosaicking and denoising [pdf]
image harmonization

domain adaptation
- Unsupervised Pixel–Level Domain Adaptation with Generative Adversarial Network. [pdf] [code]

general image-to-image translation
- paired training data
  - Image-to-Image Translation with Conditional Adversarial Nets. [pdf] [code] (pixelGAN)
  - High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. [pdf]: extend pixel2pixel GAN with coarse-to-fine strategy.
- unpaired training data
  - Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. [pdf][code] (CycleGAN)
    - DualGAN: Unsupervised Dual Learning for Image-to-Image Translation [pdf]
  - Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. [pdf] (discoGAN)

Surveys

Image-to-Image Translation: Methods and Applications

Image Manipulation Detection

Posted on 2022-06-16 | In paper note

Datasets

Casia V1.0/2.0: [paper] [v1] [v2]
Columbia Uncompressed Image Splicing Detection: [dataset]
Pawel korus-Realistic Tampering Dataset: [dataset]
Coverage: [dataset]
NIST16: [dataset]
DEFACTO: [dataset]
self-made dataset with [code]

Methods

MantraNet [code]: compare each pixel with neighboring pixels
MAGritte [code]: a combination of generation and discrimination
H-LSTM [paper] [code]: 1. resampling features 2. use Hilbert curve to determine the patch order
Constrained-RCNN [code]: constrained convolution
GSRNet [paper] [code]: data augmentation
SPAN [code]: pyramid self-attention

Image Loss

Posted on 2022-06-16 | In paper note

perceptual loss [1]: two images have similar semantic information
$\frac{1}{C_j H_j W_j}||\phi_j(\hat{x})-\phi_j(x)||^2$
style loss [2]: two images have similar channel correlation; related to bilinear pooling [6]
$||G_j^{\phi}(\hat{x})-G_j^{\phi}(x)||_F^2$
with $G_j^{\phi}(x)_{c,c'}=\frac{1}{C_j H_j W_j}\sum_{h=1}^{H_j}\sum_{w=1}^{W_j}\phi_j(x)_{h,w,c}\phi_j(x)_{h,w,c'}$
pairwise mean squared error (PMSE) [3] [4]: scale-invariant mean squared error (in log space)
$\frac{1}{n}\sum_i d_i^2 - \frac{1}{n^2}(\sum_i d_i)^2$
total variation (TV) loss [1]: smoothness
$\sum_{(i,j)} ||x_{i,j+1}-x_{i,j}||_1 +||x_{i+1,j}-x_{i,j}||_1$
alignment loss [5]: two images have similar spatial correlation, complementary to style loss
$||F_j^{\phi}(\hat{x})-F_j^{\phi}(x)||_F^2$
with $F_j^{\phi}(x)_{d,d'}=\frac{1}{C_j H_j W_j}\sum_{c=1}^{C}\phi_j(x)_{d,c}\phi_j(x)_{d',c}$

Reference

[1] Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual losses for real-time style transfer and super-resolution.” ECCV, 2016.

[2] Gatys, Leon, Alexander S. Ecker, and Matthias Bethge. “Texture synthesis using convolutional neural networks.” NIPS, 2015.

[3] Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” NIPS, 2014.

[4] Bousmalis, Konstantinos, et al. “Unsupervised pixel-level domain adaptation with generative adversarial networks.” CVPR, 2017.

[5] Abavisani, Mahdi, Hamid Reza Vaezi Joze, and Vishal M. Patel. “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training.” CVPR, 2019.

[6] Lin, Tsung-Yu, Aruni RoyChowdhury, and Subhransu Maji. “Bilinear cnn models for fine-grained visual recognition.” ICCV, 2015.

Image Inpainting

Posted on 2022-06-16 | In paper note

Partial and Gated Convolution

partial convolution [1]: hard-gating single-channel unlearnable layer
gated convolution [2]: soft-gating multi-channel learnable layer

Filling Priority

filling priority [3]: Priority is the product of confidence term (a measure of the amount of reliable information surrounding the pixel) and data term (a function of the strength of isophotes hitting the front). Select the patch to be filled based on the priority, similar to patch-based texture synthesis.

<img src="http://bcmi.sjtu.edu.cn/~niuli/github_images/bO5YXEQ.jpg" width="40%">

Diverse image inpainting

random vector: use random vector to generate diverse and plausible outputs [6]
attribute vector: use target attribute values to guide image inpainting [7]
use autoregressive model: [11] [12]

Auxiliary Information

Semantics
- enforce inpainted result to have expected semantics [8]
- first inpaint semantic map and then use complete semantic map as guidance [9]
- guide feature learning in the decoder [10]
- semantic-aware attention [13]
Edges
- Inpaint edge map and use complete edge map to help image inpainting [4] [5]

Frequency Domain

using frequency map as network input [14]
fourier convolution: LAMA[15])
wavelet [16]

Bridging Inpainting and Generation

Co-Mod [17]

Transformer

[12] [18] [19]

Diffusion Model

[20] [21] [22] [23]

References

Liu, Guilin, et al. “Image inpainting for irregular holes using partial convolutions.” ECCV, 2018.
Yu, Jiahui, et al. “Free-form image inpainting with gated convolution.” ICCV, 2019.
Criminisi, Antonio, Patrick Pérez, and Kentaro Toyama. “Region filling and object removal by exemplar-based image inpainting.” TIP, 2004.
Nazeri, Kamyar, et al. “Edgeconnect: Generative image inpainting with adversarial edge learning.” arXiv preprint arXiv:1901.00212 (2019).
Xiong, Wei, et al. “Foreground-aware image inpainting.” CVPR, 2019.
Zheng, Chuanxia, Tat-Jen Cham, and Jianfei Cai. “Pluralistic image completion.” CVPR, 2019.
Chen, Zeyuan, et al. “High resolution face completion with multiple controllable attributes via fully end-to-end progressive generative adversarial networks.” arXiv preprint arXiv:1801.07632 (2018).
Li, Yijun, et al. “Generative face completion.” CVPR, 2017.
Song, Yuhang, et al. “Spg-net: Segmentation prediction and guidance network for image inpainting.” arXiv preprint arXiv:1805.03356 (2018).
Liao, Liang, et al. “Guidance and evaluation: Semantic-aware image inpainting for mixed scenes.” arXiv preprint arXiv:2003.06877 (2020).
Peng, Jialun, et al. “Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE.” CVPR, 2021.
Wan, Ziyu, et al. “High-Fidelity Pluralistic Image Completion with Transformers.” arXiv preprint arXiv:2103.14031 (2021).
Liao, Liang, et al. “Image inpainting guided by coherence priors of semantics and textures.” CVPR, 2021.
Roy, Hiya, et al. “Image inpainting using frequency domain priors.” arXiv preprint arXiv:2012.01832 (2020).
Suvorov, Roman, et al. “Resolution-robust Large Mask Inpainting with Fourier Convolutions.” WACV (2021).
Yu, Yingchen, et al. “WaveFill: A Wavelet-based Generation Network for Image Inpainting.” ICCV, 2021.
Zhao, Shengyu, et al. “Large scale image completion via co-modulated generative adversarial networks.” ICLR (2021).
Zheng, Chuanxia, et al. “Bridging global context interactions for high-fidelity image completion.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Li, Wenbo, et al. “Mat: Mask-aware transformer for large hole image inpainting.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
Lugmayr, Andreas, et al. “Repaint: Inpainting using denoising diffusion probabilistic models.” CVPR, 2022.
Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” CVPR, 2022.
Li, Wenbo, et al. “SDM: Spatial Diffusion Model for Large Hole Image Inpainting.” arXiv preprint arXiv:2212.02963 (2022).
Wang, Su, et al. “Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting.” arXiv preprint arXiv:2212.06909 (2022).

Image Harmonization

Posted on 2022-06-16 | In paper note

Fundamental

Image Statistics: illuminance, color temperature, saturation, local contrast, hue, texture, tone

Color spaces: RGB color space, CIELab color space (saturation/chrominance, hue, luminance).

Image realism

Predict the realism using the discriminator learnt based on real images and fake images [a]
Predict the realism based on global and local statistics: distance to neighboring realistic image, similarity between foreground and background [a]

Image harmonization

After pasting the foreground on the background, harmonize the foreground.

Traditional methods: match the foreground with the background; match the foreground with other semantically or statistically close realistic images.
- histogram matching: [a] [b] [c]
- Gaussian model matching: [a] [b] [c]
- pixel-level classification: MRF[a]
- gamma correction [a] [b]

Deep learning methods: https://github.com/bcmi/Awesome-Image-Harmonization

One interesting problem in image harmonization is whether the decomposition of reflectance and illumination is unique. If we have strong prior knowledge for the object reflectance (e.g., black-and-white zebra), the decomposition may be unique. Or if the object color is complex enough, which is equivalent to adding enough constraints, the decomposition may be unique. Otherwise, if we do not have strong prior knowledge for the object reflectance (e.g., a vase of arbitrary color) and the object color is simple (e.g., a single color), the decomposition is not unique.

Given a source image and an obtained target image after applying color transfer, we hope to know whether there exists a valid path between source image and target image and whether there exist multiple valid paths between them.

Deep painterly harmonization

deep painterly harmonization [1]
style harmonization [2]
image blending [3]

Reference

[1] Luan, Fujun, et al. “Deep painterly harmonization.” Computer graphics forum. Vol. 37. No. 4. 2018.

[2] Peng, Hwai-Jin, Chia-Ming Wang, and Yu-Chiang Frank Wang. “Element-Embedded Style Transfer Networks for Style Harmonization.” BMVC. 2019.

[3] Zhang, Lingzhi, Tarmily Wen, and Jianbo Shi. “Deep image blending.” WACV. 2020.

Image Composition

Posted on 2022-06-16 | In paper note

Simply speaking, image composition means cut-and-paste, that is, cutting one piece from one image and paste it on another image. The obtained composite image may be unrealistic due to the following reasons:

The foreground is not well segmented, so there is an evident and unnatural boundary between foreground and background.
The foreground and background may look incompatible due to different color and illumination statistics. For example, the foreground is captured in the daytime while the background is captured at night.
The foreground is placed at an unreasonable location. For example, a horse is placed in the sky.
The foreground needs to be geometrically transformed. For example, when pasting eye glasses on a face, the eye glasses should fit the eyes and ears on the face.
The pasted foreground may also affect the background. For example, the foreground may cast a shadow on the background.

Therefore, image composition is actually a combination of multiple subtasks.
Previously, some works only focus on one subtask such as harmonization or geometric transformation [1]. Some other works attempt to solve all subtasks in a single package [2] [3] [4] [5] [6].

Human matting+composition: [7]

Reference

[1] Lin, Chen-Hsuan, et al. “St-gan: Spatial transformer generative adversarial networks for image compositing.”, CVPR, 2018.

[2] Tan, Fuwen, et al. “Where and who? automatic semantic-aware person composition.” WACV, 2018.

[3] Chen, Bor-Chun, and Andrew Kae. “Toward Realistic Image Compositing with Adversarial Learning.” CVPR, 2019.

[4] Lingzhi Zhang, Tarmily Wen, Jianbo Shi: Deep Image Blending. WACV 2020: 231-240

[5] Weng, Shuchen, et al. “MISC: Multi-Condition Injection and Spatially-Adaptive Compositing for Conditional Person Image Synthesis.” CVPR, 2020.

[6] Zhan, Fangneng, et al. “Adversarial Image Composition with Auxiliary Illumination.” arXiv preprint arXiv:2009.08255 (2020).

[7] Zhang, He, et al. “Deep Image Compositing.” arXiv preprint arXiv:2011.02146 (2020).

Image Blending

Posted on 2022-06-16 | In paper note

The target is to cut the foreground from one image and paste it on another image, followed by adjusting the foreground. The prevalent technique Poisson blending [1] [2], also called seamless cloning, is matching the gradient with boundary conditions via solving Poisson equation. In image harmonization, the original image containing the foreground may be unavailable.

Image Action Recognition with Unlabeled Videos

Posted on 2022-06-16 | In paper note

Self-supervised learning: see video-to-image in this blog.
predict optical flow and use two-stream network [1]
Predicting pose information (use poselet detector) [2]

Reference:

[1] Gao, Ruohan, Bo Xiong, and Kristen Grauman. “Im2flow: Motion hallucination from static images for action recognition.” CVPR, 2018.

[2] Chen, Chao-Yeh, and Kristen Grauman. “Watching unlabeled video helps learn new human actions from very few labeled snapshots.” CVPR, 2013.

High-resolution Image Generation

Posted on 2022-06-16 | In paper note

stacked generators from low-resolution to high-resolution: [4] [5] [6] [10]
low-resolution generator embedded in high-resolution generator, upsample low-resolution result and add residual: [1] [7] [8] [9] [12]
fuse low-resolution outputs: [3] [11]
shallow mapping from large-scale input to large-scale output: [2](look-up table) [15] [16]
joint upsampling: given high-resolution input and low-resolution output, get high-resolution output. 1) append high-resolution input [1] or the feature of high-resolution input [10] to refinement network. 2) guided filter [13], use high-resolution input as guidance and coarse high-resolution output as filter input. 3) attentional upsampling [14]

Reference

[1] Wang, Ting-Chun, et al. “High-resolution image synthesis and semantic manipulation with conditional gans.” CVPR, 2018.

[2] Zeng, Hui, et al. “Learning Image-adaptive 3D Lookup Tables for High Performance Photo Enhancement in Real-time.” PAMI, 2020.

[3] Yu, Haichao, et al. “High-Resolution Deep Image Matting.” arXiv preprint arXiv:2009.06613 (2020).

[4] Denton, Emily L., Soumith Chintala, and Rob Fergus. “Deep generative image models using a laplacian pyramid of adversarial networks.” NIPS, 2015.

[5] Huang, Xun, et al. “Stacked generative adversarial networks.” CVPR, 2017.

[6] Zhang, Han, et al. “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks.” ICCV, 2017.

[7] Andreini, Paolo, et al. “A two stage gan for high resolution retinal image generation and segmentation.” arXiv preprint arXiv:1907.12296 (2019).

[8] Hamada, K., Tachibana, K., Li, T., Honda, H., & Uchida, Y. (2018). Full-body high-resolution anime generation with progressive structure-conditional generative adversarial networks. ECCV, 2018.

[9] Karras, Tero, et al. “Progressive growing of gans for improved quality, stability, and variation.” arXiv preprint arXiv:1710.10196 (2017).

[10] Chen, Qifeng, and Vladlen Koltun. “Photographic image synthesis with cascaded refinement networks.” ICCV, 2017.

[11] Anokhin, Ivan, et al. “High-Resolution Daytime Translation Without Domain Labels.” CVPR, 2020.

[12] Yi, Zili, et al. “Contextual residual aggregation for ultra high-resolution image inpainting.” CVPR, 2020.

[13] Wu, Huikai, et al. “Fast end-to-end trainable guided filter.” CVPR, 2018.

[14] Kundu, Souvik, et al. “Attention-based Image Upsampling.” arXiv preprint arXiv:2012.09904 (2020).

[15] Cong, Wenyan, et al. “High-Resolution Image Harmonization via Collaborative Dual Transformations.” CVPR, 2022.

[16] Liang, Jingtang, Xiaodong Cun, and Chi-Man Pun. “Spatial-Separated Curve Rendering Network for Efficient and High-Resolution Image Harmonization.” ECCV, 2022.