Transformer

Transformer [1] is multi-head self-attention, which can be used for sequence-to-sequence or sequence-to-label applications.

Applications

classification: ViT[2] RVT [12]
object detection: DETR[3]
person reID: [7]
general low-level vision (super-resolution, denoising, deraining) [4] [16]
unconditional GAN: [10]
video understanding: [11]
image colorization: [6] [17]
style transfer: [15]

Combination of transformer and CNN:

T2T-Vit [9]
TransUNet [8]
Next-ViT [14]

Interpretability

beyond attention visualization [5]

Position embedding

A summary

Lightweight Transformer

XFormer [13]

A summary of efficient Transformers

Survey

Reference

[1] Vaswani, Ashish, et al. “Attention is all you need.” NeurIPS, 2017.

[2] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

[3] Carion, Nicolas, et al. “End-to-End Object Detection with Transformers.” arXiv preprint arXiv:2005.12872 (2020).

[4] Chen, Hanting, et al. “Pre-Trained Image Processing Transformer.” arXiv preprint arXiv:2012.00364 (2020).

[5] Chefer, Hila, Shir Gur, and Lior Wolf. “Transformer Interpretability Beyond Attention Visualization.” arXiv preprint arXiv:2012.09838 (2020).

[6] Manoj Kumar, Dirk Weissenborn & Nal Kalchbrenner, “COLORIZATION TRANSFORMER”, ICLR, 2021.

[7] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang, “TransReID: Transformer-based Object Re-Identification”, arXiv preprint arXiv:2102.04378 (2021).

[8] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, Yuyin Zhou, “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation”, arXiv preprint arXiv:2102.04306 (2021).

[9] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan, “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, arXiv preprint arXiv:2101.11986 (2021).

[10] Yifan Jiang, Shiyu Chang, Zhangyang Wang, “TransGAN: Two Transformers Can Make One Strong GAN”, arXiv preprint arXiv:2102.07074 (2021).

[11] Is Space-Time Attention All You Need for Video Understanding.

[12] Towards Robust Vision Transformer.

[13] Zhao, Youpeng, et al. “Lightweight Vision Transformer with Cross Feature Attention.” arXiv preprint arXiv:2207.07268 (2022).

[14] Li, Jiashi, et al. “Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios.” arXiv preprint arXiv:2207.05501 (2022).

[15] Deng, Yingying, et al. “StyTr2: Image Style Transfer with Transformers.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[16] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, Houqiang Li. “Uformer: A General U-Shaped Transformer for Image Restoration.”

[17] Shuchen Weng, Jimeng Sun, Yu Li, Si Li, and Boxin Shi. “CT2: Colorization Transformer via Color Tokens”, ECCV, 2022.