GUI Agent

Posted on 2026-03-17 Edited on 2026-06-29 In paper note

Research Directions

Safety

[1] Chen, Baicheng, et al. “AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents.” CVPR, 2026. [pdf]

[2] Yan, Zihe, et al. “Lasm: Layer-wise scaling mechanism for defending pop-up attack on gui agents.” CVPR, 2026. [pdf] [code]

Inference Efficiency

[1] Zhou, Xurui, et al. “Hiconagent: History context-aware policy optimization for gui agents.” CVPR, 2026. [pdf] [code] (dynamic history length)

[2] Mehrotra, Sarthak, et al. “ishift: Lightweight slow-fast gui agent with adaptive perception.” CVPR, 2026. [pdf]

Long Horizon

[1] Deng, Zehao, et al. “Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation.” CVPR, 2026. [pdf] [code]

[2] Kang, Bin, et al. “LongHorizonUI: A Unified Framework for Robust Long-Horizon Task Automation of GUI Agent.” ICLR, 2026. [pdf]

[3] Zeng, Ziyun, et al. “MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents.” arXiv preprint arXiv:2605.18652 (2026). [pdf]

[4] Zhou, Bowen, et al. “Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression.” arXiv preprint arXiv:2603.00188 (2026). [pdf]

[5] Wang, Jihong, et al. “ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution.” ACL Industry Track. 2026. [pdf]

[6] Lu, Zhengxi, et al. “UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization.” arXiv preprint arXiv:2604.13822 (2026). [pdf]

Extra Guidance

[1] Xie, Rui, et al. “GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation.” arXiv preprint arXiv:2603.26266 (2026). [pdf] [code]

[2] Liu, Jingjing, et al. “DocOS: Towards Proactive Document-Guided Actions in GUI Agents.” ICML, 2026. [pdf] [code]

[3] Einsia. “Scalable Behaviour Cloning on Browser Using via Skill Distillation.” [pdf] [code]

New Domain

[1] Li, Yang, et al. “Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents.” CVPR, 2026. [pdf]

[2] Chen, Yuxi, et al. “CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training.” ICML, 2026. [pdf] [code]

[3] Liu, Ziwei, et al. “Continual GUI Agents.” ICML, 2026. [pdf] [code]

Training Data Synthesis

[1] Zhang, Bofei, et al. “Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents.” AAAI, 2026. [pdf] [code]

[2] Shao, Rui, et al. “Hats: Hardness-aware trajectory synthesis for gui agents.” CVPR, 2026. [pdf] [code]

[3] Lv, Rui, et al. “M $^ 2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining.” ICLR, 2026. [pdf]

[4] Xiong, Weimin, et al. “Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining.” ICML, 2026. [pdf] [code]

Optimization

[1] Xu, Yifan, et al. “Mobilerl: Online agentic reinforcement learning for mobile gui agents.” arXiv preprint arXiv:2509.18119 (2025). [pdf] [code]

World Model

[1] Guan, Yiming, et al. “Computer-using world model.” arXiv preprint arXiv:2602.17365 (2026). [pdf]

[2] Luo, Dezhao, et al. “Vimo: A generative visual gui world model for app agents.” arXiv preprint arXiv:2504.13936 (2025). [pdf] [code]

[3] Cao, Yilin, et al. “MobileDreamer: Generative Sketch World Model for GUI Agent.” arXiv preprint arXiv:2601.04035 (2026). [pdf]

Challenging Benchmark

[1] Gong, Yichen, et al. “VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics.” ICML, 2026. [pdf] [code]