CLIP

  • image classication: CLIP [1], learnable prompt [2]

  • video classification: ActionCLIP [3]

  • object detection: ViLD [4], ZSD-YOLO [6]

  • segmentation: [8] [9]

  • visual grounding: CPT [5]

  • image translation: StyleClip [7]

Reference

[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” arXiv preprint arXiv:2103.00020 (2021).

[2] Zhou, Kaiyang, et al. “Learning to Prompt for Vision-Language Models.” arXiv preprint arXiv:2109.01134 (2021).

[3] Wang, Mengmeng, Jiazheng Xing, and Yong Liu. “ActionCLIP: A New Paradigm for Video Action Recognition.” arXiv preprint arXiv:2109.08472 (2021).

[4] Gu, Xiuye, et al. “Zero-Shot Detection via Vision and Language Knowledge Distillation.” arXiv preprint arXiv:2104.13921 (2021).

[5] Yao, Yuan, et al. “CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models.” arXiv preprint arXiv:2109.11797 (2021).

[6] Xie, Johnathan, and Shuai Zheng. “ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation.” arXiv preprint arXiv:2109.12066 (2021).

[7] Patashnik, Or, et al. “Styleclip: Text-driven manipulation of stylegan imagery.” ICCV, 2021.

[8] Xu, Mengde, et al. “A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model.” arXiv preprint arXiv:2112.14757 (2021).

[9] Lüddecke, Timo, and Alexander Ecker. “Image Segmentation Using Text and Image Prompts.” CVPR, 2022.