Interpretable Machine Learning
Manipulate each layer/neuron, and observe the change of network parameters/activations.
Saliency map
Adversarial attack
Correlation
Information gain/loss
Manipulate each layer/neuron, and observe the change of network parameters/activations.
Saliency map
Adversarial attack
Correlation
Information gain/loss
The first work [1] on using web images and their tags to augment image-sentence pairs. We try to reproduce it, but it does not work at all.
The text associated with a web image generally consists of tags, title, and description.
The tags are very noisy, but they are acceptable for webly supervised image classification. The titles and descriptions are noisier. Only a few descriptions are complete sentences and match the corresponding images.
Conceptual caption dataset [2] crawled web images and their alt text, and developed an automatic pipeline that extracts, filters, and transforms candidate image-caption pairs, resulting in relatively clean image-text pairs. This large corpus of web image-text pairs can be used for pretraining image-text retrieval model or image captioning model.
[1] Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury:
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. ACM MM, 2018.
[2] Sharma, Piyush, et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.” ACL, 2018.
The target is separating foreground from background given some user annotation (e.g., trimask, scribble). The prevalent technique alpha matting is to solve $\mathbf{\alpha}$ (primary target), $\mathbf{F}$, $\mathbf{B}$ (subordinate target) in $\mathbf{I}=\mathbf{\alpha}\circ\mathbf{F}+(1-\mathbf{\alpha})\circ \mathbf{B}$ [1] [2] [3].
Alphamatting.com Dataset: 25 train images, 8 test images, each has 3 different trimaps: small, large, user. Input: image and trimap.
Composition-1k Dataset: 1000 images and 50 unique foregrounds.
Matting Human Dataset: 34427 images, annotation is not very accurate.
Dinstinctions-646: composed of 646 foreground images
Affinity-based [1]: pixel similarity metrics that rely on color similarity or spatial proximity.
Sampling-based [8]: the foreground/background color of unknown pixels can be obtained by sampling the foreground/background color of known pixels.
Learning-based
gradient loss [11] Laplacian loss [12]
Omnimatte [10]: segment objects and scene effects related to the objects (shadows, reflections, smoke)
unified interactive image matting: [13]
[1] Aksoy, Yagiz, Tunc Ozan Aydin, and Marc Pollefeys. “Designing effective inter-pixel information flow for natural image matting.” CVPR, 2017.
[2] Xu, Ning, et al. “Deep image matting.” CVPR, 2017.
[3] Zhu, Bingke, et al. “Fast deep matting for portrait animation on mobile phone.” ACM MM, 2017.
[4] Wang, Yu, et al. “Deep Propagation Based Image Matting.” IJCAI. 2018.
[5] Quan Chen, Tiezheng Ge, Yanyu Xu, Zhiqiang Zhang, Xinxin Yang, Kun Gai, “Semantic Human Matting.” ACM MM, 2018.
[6] Lutz, Sebastian, Konstantinos Amplianitis, and Aljosa Smolic. “AlphaGAN: Generative adversarial networks for natural image matting.” BMVC, 2018.
[7] Jingwei Tang, Yagız Aksoy, Cengiz Oztireli, Markus Gross, Tunc Ozan Aydın. “Learning-based Sampling for Natural Image Matting”, CVPR, 2019.
[8] Feng, Xiaoxue, Xiaohui Liang, and Zili Zhang. “A cluster sampling method for image matting via sparse coding.” ECCV, 2016.
[9] Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman:
Background Matting: The World is Your Green Screen. CVPR, 2020.
[10] Lu, Erika, et al. “Omnimatte: Associating Objects and Their Effects in Video.” CVPR, 2021.
[11] Zhang, Yunke, et al. “A late fusion cnn for digital matting.” CVPR, 2019.
[12] Hou, Qiqi, and Feng Liu. “Context-aware image matting for simultaneous foreground and alpha estimation.” ICCV. 2019.
[13] Yang, Stephen, et al. “Unified interactive image matting.” arXiv preprint arXiv:2205.08324 (2022).
Columbia Uncompressed Image Splicing Detection: [dataset]
Pawel korus-Realistic Tampering Dataset: [dataset]
Coverage: [dataset]
NIST16: [dataset]
DEFACTO: [dataset]
self-made dataset with [code]
MantraNet [code]: compare each pixel with neighboring pixels
MAGritte [code]: a combination of generation and discrimination
H-LSTM [paper] [code]: 1. resampling features 2. use Hilbert curve to determine the patch order
Constrained-RCNN [code]: constrained convolution
SPAN [code]: pyramid self-attention
perceptual loss [1]: two images have similar semantic information
style loss [2]: two images have similar channel correlation; related to bilinear pooling [6]
with
pairwise mean squared error (PMSE) [3] [4]: scale-invariant mean squared error (in log space)
total variation (TV) loss [1]: smoothness
alignment loss [5]: two images have similar spatial correlation, complementary to style loss
with
[1] Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual losses for real-time style transfer and super-resolution.” ECCV, 2016.
[2] Gatys, Leon, Alexander S. Ecker, and Matthias Bethge. “Texture synthesis using convolutional neural networks.” NIPS, 2015.
[3] Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” NIPS, 2014.
[4] Bousmalis, Konstantinos, et al. “Unsupervised pixel-level domain adaptation with generative adversarial networks.” CVPR, 2017.
[5] Abavisani, Mahdi, Hamid Reza Vaezi Joze, and Vishal M. Patel. “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training.” CVPR, 2019.
[6] Lin, Tsung-Yu, Aruni RoyChowdhury, and Subhransu Maji. “Bilinear cnn models for fine-grained visual recognition.” ICCV, 2015.
Self-supervised learning: see video-to-image in this blog.
predict optical flow and use two-stream network [1]
Predicting pose information (use poselet detector) [2]
[1] Gao, Ruohan, Bo Xiong, and Kristen Grauman. “Im2flow: Motion hallucination from static images for action recognition.” CVPR, 2018.
[2] Chen, Chao-Yeh, and Kristen Grauman. “Watching unlabeled video helps learn new human actions from very few labeled snapshots.” CVPR, 2013.
Combine different components: [1] [2]
Frühstück, Anna, et al. “Insetgan for full-body image generation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Huang, Zehuan, et al. “From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation.” arXiv preprint arXiv:2404.15267 (2024).
[1] Gradient Harmonized Single-stage Detector, AAAI, 2019
Geometry feature generation based on unsupervisely detected landmarks. [1]
Disentangle bottleneck features into category-invariant features and category-specific features. Category-invariant features encode the pose information.