• interaction and active reasoning: [2] (Sokoban, Maze, and Taxi)

  • performs both policy and world model learning: [3]

Survey: [1]

Reference

[1] Zhao, Changyuan, et al. “Edge general intelligence through world models and agentic AI: Fundamentals, solutions, and challenges.” arXiv preprint arXiv:2508.09561 (2025).
[2] Shu, Bao, et al. “Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction.” arXiv preprint arXiv:2511.23476 (2025).
[3] Yu, Xiao, et al. “Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents.” arXiv preprint arXiv:2506.00320 (2025).

Survey: [1]

Reference

[1] Tong, Jingqi, et al. “Thinking with video: Video generation as a promising multimodal reasoning paradigm.” arXiv preprint arXiv:2511.04570 (2025).

Survey: [1]

Reference

[1] Su, Zhaochen, et al. “Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.” arXiv preprint arXiv:2506.23918 (2025).

We use the Dawid-Skene vote aggregation algorithm to obtain the ground truth label for each snippet, since this is often considered ‘gold standard’ for aggregation in practice. DawidSkene is an unsupervised inference algorithm that gives the Maximum Likelihood Estimate of observer error rates using the EM algorithm.

1) Using the labels given by multiple annotators, estimate the most likely “correct” label for each video snippet.

2) Based on the estimated correct answer for each object, compute the error rates for each annotator.

3) Taking into consideration the error rates for each annotator, recompute the most likely “correct” label for each object.

4) Repeat steps 2 and 3 until one of the termination criteria is met (error rates are below a pre-specified threshold or a pre-specified number of iterations are completed).

Vector Quantization: VQVAE [1],VQVAE2 [2], VQGAN [6].

Residual Quantization: RQVAE [3]

Accelerate auto-regression: [4] [5]

Hierarchical residual quantization: VAR [7]

References

[1] Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. “Neural discrete representation learning.” arXiv preprint arXiv:1711.00937 (2017).

[2] Razavi, Ali, Aaron van den Oord, and Oriol Vinyals. “Generating diverse high-fidelity images with vq-vae-2.” Advances in neural information processing systems. 2019.

[3] Lee, Doyup, et al. “Autoregressive Image Generation using Residual Quantization.” arXiv preprint arXiv:2203.01941 (2022).

[4] Bond-Taylor, Sam, et al. “Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes.” arXiv preprint arXiv:2111.12701 (2021).

[5] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman, “MaskGIT: Masked Generative Image Transformer”, arXiv preprint arXiv:2202.04200.

[6] Patrick Esser, Robin Rombach, Björn Ommer, “Taming Transformers for High-Resolution Image Synthesis”.

[7] Tian, Keyu, et al. “Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.” arXiv preprint arXiv:2404.02905 (2024).

  1. learn attribute vector based on the relation and difference between different categories (each dimension if uninterpretable): [1] (Laplacian matrix), [2] (triplet loss)

  2. exploit local information and encode them into attribute vector (each dimension is interpretable): [3] (discriminative cluster, doublets), [4] (joint attribute learning and feature learning)

  3. learn attention map for each latent attribute [5]

Reference

  1. Yu, Felix X., et al. “Designing category-level attributes for discriminative visual recognition.” CVPR, 2013.

  2. Li, Yan, et al. “Discriminative learning of latent features for zero-shot recognition.” CVPR, 2018.

  3. Singh, Saurabh, Abhinav Gupta, and Alexei A. Efros. “Unsupervised discovery of mid-level discriminative patches.” ECCV, 2012.

  4. Huang, Chen, Chen Change Loy, and Xiaoou Tang. “Unsupervised learning of discriminative attributes and visual representations.” CVPR, 2016.

  5. Yang, Wenjie, et al. “Towards rich feature discovery with class activation maps augmentation for person re-identification.” CVPR, 2019.

Let us use $S$ to denote the set of training categories and $T$ to denote the set of testing categories.

  • $S=T$: the most common case
  • $S\cap T=\emptyset$: zero-shot learning
  • $S\subset T$: generalized zero-shot learning
  • $S\supset T$: pretrained model

  1. Blend text and background images.

    • text image (font, color, border, blending): [1]
    • scene-text image (font, color, border, blending, geometry): [2] [3]

Reference:

  1. Jaderberg, Max, et al. “Synthetic data and artificial neural networks for natural scene text recognition.” arXiv preprint arXiv:1406.2227 (2014).

  2. Gupta, Ankush, Andrea Vedaldi, and Andrew Zisserman. “Synthetic data for text localisation in natural images.” CVPR, 2016.

  3. Zhan, Fangneng, Shijian Lu, and Chuhui Xue. “Verisimilar image synthesis for accurate detection and recognition of texts in scenes.” ECCV, 2018.

As mentioned in [1] One major concern of subjective annotation is that the annotations provided by different workers for each image may not be reliable, which calls for consistency analysis on the annotations. We use Spearman’s rank correlation ρ between pairs of workers to measure consistency and estimate p-values to evaluate statistical significance of the correlation relative to a null hypothesis of uncorrelated responses. We use the Benjamini-Hochberg
procedure to control the false discovery rate (FDR) for multiple comparisons [2]. At an FDR level of 0.05, we find 98.45% batches have significant agreement among raters. Further consistency analysis of the dataset can be found in the supplementary material of [1].

[1] Kong, Shu, et al. “Photo aesthetics ranking network with attributes and content adaptation.” European Conference on Computer Vision. Springer, Cham, 2016.

[2] Benjamini, Yoav, and Daniel Yekutieli. “The control of the false discovery rate in multiple testing under dependency.” Annals of statistics (2001): 1165-1188.

0%