Transformer [1] is multi-head self-attention, which can be used for sequence-to-sequence or sequence-to-label applications.
Applications
- classification: ViT[2] RVT [12]
- object detection: DETR[3]
- person reID: [7]
- general low-level vision (super-resolution, denoising, deraining) [4] [16]
- unconditional GAN: [10]
- video understanding: [11]
- image colorization: [6] [17]
- style transfer: [15]
Combination of transformer and CNN:
Interpretability
- beyond attention visualization [5]
Position embedding
Lightweight Transformer
- XFormer [13]
A summary of efficient Transformers
Survey
Reference
[1] Vaswani, Ashish, et al. “Attention is all you need.” NeurIPS, 2017.
[2] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
[3] Carion, Nicolas, et al. “End-to-End Object Detection with Transformers.” arXiv preprint arXiv:2005.12872 (2020).
[4] Chen, Hanting, et al. “Pre-Trained Image Processing Transformer.” arXiv preprint arXiv:2012.00364 (2020).
[5] Chefer, Hila, Shir Gur, and Lior Wolf. “Transformer Interpretability Beyond Attention Visualization.” arXiv preprint arXiv:2012.09838 (2020).
[6] Manoj Kumar, Dirk Weissenborn & Nal Kalchbrenner, “COLORIZATION TRANSFORMER”, ICLR, 2021.
[7] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang, “TransReID: Transformer-based Object Re-Identification”, arXiv preprint arXiv:2102.04378 (2021).
[8] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L. Yuille, Yuyin Zhou, “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation”, arXiv preprint arXiv:2102.04306 (2021).
[9] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan, “Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet”, arXiv preprint arXiv:2101.11986 (2021).
[10] Yifan Jiang, Shiyu Chang, Zhangyang Wang, “TransGAN: Two Transformers Can Make One Strong GAN”, arXiv preprint arXiv:2102.07074 (2021).
[11] Is Space-Time Attention All You Need for Video Understanding.
[12] Towards Robust Vision Transformer.
[13] Zhao, Youpeng, et al. “Lightweight Vision Transformer with Cross Feature Attention.” arXiv preprint arXiv:2207.07268 (2022).
[14] Li, Jiashi, et al. “Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios.” arXiv preprint arXiv:2207.05501 (2022).
[15] Deng, Yingying, et al. “StyTr2: Image Style Transfer with Transformers.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[16] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, Houqiang Li. “Uformer: A General U-Shaped Transformer for Image Restoration.”
[17] Shuchen Weng, Jimeng Sun, Yu Li, Si Li, and Boxin Shi. “CT2: Colorization Transformer via Color Tokens”, ECCV, 2022.