Image-text Retrieval

Webly supervised image-text retrieval

The first work [1] on using web images and their tags to augment image-sentence pairs. We try to reproduce it, but it does not work at all.

The text associated with a web image generally consists of tags, title, and description.
The tags are very noisy, but they are acceptable for webly supervised image classification. The titles and descriptions are noisier. Only a few descriptions are complete sentences and match the corresponding images.

Conceptual caption dataset [2] crawled web images and their alt text, and developed an automatic pipeline that extracts, filters, and transforms candidate image-caption pairs, resulting in relatively clean image-text pairs. This large corpus of web image-text pairs can be used for pretraining image-text retrieval model or image captioning model.

Image-text (Chinse) Datasets

Reference

[1] Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, Amit K. Roy-Chowdhury:
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. ACM MM, 2018.

[2] Sharma, Piyush, et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.” ACL, 2018.