Webly supervised image-text retrieval
The first work [1] on using web images and their tags to augment image-sentence pairs. We try to reproduce it, but it does not work at all.
The text associated with a web image generally consists of tags, title, and description.
The tags are very noisy, but they are acceptable for webly supervised image classification. The titles and descriptions are noisier. Only a few descriptions are complete sentences and match the corresponding images.
Conceptual caption dataset [2] crawled web images and their alt text, and developed an automatic pipeline that extracts, filters, and transforms candidate image-caption pairs, resulting in relatively clean image-text pairs. This large corpus of web image-text pairs can be used for pretraining image-text retrieval model or image captioning model.
Image-text (Chinse) Datasets
- AI challenger 2017: training set code:asix validation set code:dcnn
- f30k-cn
Reference
[1] Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury:
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. ACM MM, 2018.
[2] Sharma, Piyush, et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.” ACL, 2018.