Newly Blog

Cut and Paste

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Do segmentation, image enhancemnet, and inpainting simultaneously [1]
Learning to Segment via Cut-and-Paste [2]

Reference

[1] Ostyakov, Pavel, et al. “SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint.” arXiv preprint arXiv:1811.07630 (2018).

[2] Remez, Tal, Jonathan Huang, and Matthew Brown. “Learning to segment via cut-and-paste.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

Consistent Video Editing

Posted on 2026-03-17 Edited on 2022-09-20 In paper note

Template based: [1] [2]

References

Kasten, Yoni, et al. “Layered neural atlases for consistent video editing.” ACM Transactions on Graphics (TOG) 40.6 (2021): 1-12.
Ye, Vickie, et al. “Deformable Sprites for Unsupervised Video Decomposition.” CVPR, 2022.

Conditional GAN

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Conditioned on label vector: conditional GAN [4], CVAE-GAN [6]
Conditioned on a single image
- pix2pix [1]; high-resolution pix2pix [2] (add coarse-to-fine strategy); BicycleGAN [3] (combination of cVAE-GAN and cLR-GAN)
- DAGAN [5]

Reference

[1] Isola, Phillip, et al. “Image-to-image translation with conditional adversarial networks.” CVPR, 2017

[2] Wang, Ting-Chun, et al. “High-resolution image synthesis and semantic manipulation with conditional gans.” CVPR, 2018.

[3] Zhu, Jun-Yan, et al. “Toward multimodal image-to-image translation.” NIPS, 2017.

[4] Mirza, Mehdi, and Simon Osindero. “Conditional generative adversarial nets.” arXiv preprint arXiv:1411.1784 (2014).

[5] Antoniou, Antreas, Amos Storkey, and Harrison Edwards. “Data augmentation generative adversarial networks.” arXiv preprint arXiv:1711.04340 (2017).

[6] Bao, Jianmin, et al. “CVAE-GAN: fine-grained image generation through asymmetric training.” ICCV, 2017.

GPU Cuda and CuDNN

Posted on 2026-03-17 Edited on 2024-07-01 In hardware , GPU

GPU

look up GPU information: lspci or lshw -C display
NVIDIA system management interface, monitor GPU usage: nvidia-smi (GPU driver version and CUDA user-mode version)

GPU Driver

check the latest driver information on http://www.nvidia.com/Download/index.aspx. Then, look up driver information on local machine: cat /proc/driver/nvidia/version
check the compatibility between CUDA runtime version and driver version: https://docs.nvidia.com/deploy/cuda-compatibility/
Install NVIDIA GPU driver using GUI: Software & Updates -> Additional Drivers

Install NVIDIA GPU driver using apt-get

1
2
3

sudo add-apt-repository ppa:Ubuntu-x-swat/x-updates
sudo apt-get update
sudo apt-get install nvidia-current nvidia-current-modaliases nvidia-settings

Install NVIDIA GPU driver using *.run file downloaded from http://www.nvidia.com/Download/index.aspx
1. Hit CTRL+ALT+F1 and login using your credentials.
2. Stop your current X server session by typing sudo service lightdm stop
3. Enter runlevel 3 by typing sudo init 3 and install your *.run file.
4. You might be required to reboot when the installation finishes. If not, run sudo service lightdm start or sudo start lightdm to start your X server again.

CUDA

When using anaconda to install deep learning platform, sometimes it is unnecessary to install CUDA by yourself.

Preprocessing
- uninstall the GPU driver first: sudo /usr/bin/nvidia-uninstall or sudo apt-get remove --purge nvidia* and sudo apt-get autoremove; sudo reboot
- blacklist nouveau: add “blacklist nouveau” and “options nouveau modeset=0” at the end of /etc/modprobe.d/blacklist.conf; sudo update-initramfs -u; sudo reboot
- Stop your current X server session: sudo service lightdm stop
Install Cuda

Download the *.run file from NVIDIA website
- The latest version: https://developer.nvidia.com/cuda-downloads
- All versions: https://developer.nvidia.com/cuda-toolkit-archive
  1
  sudo sh cuda_10.0.130_410.48_linux.run
  and then add into PATH and LD_LIBRARY_PATH
  1
  2
  3
  echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
  echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
  source ~/.bashrc
check Cuda version after installation: nvcc -V. Compile and run the cuda samples.

CuDNN

CuDNN is to accelerate Cuda, from https://developer.nvidia.com/rdp/form/cudnn-download-survey, just download compressed package.

1
2
3

cd $CUDNN_PATH	
sudo cp include/* /usr/local/cuda/include/
sudo cp -P lib64/* /usr/local/cuda/lib64/ #use -P to retain symbolic links

Illumination Model

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Dichromatic Reflection Model [1] [2] $I_i=\gamma_b C_i L_i R_{b,i}+\gamma_s C_i L_i R_{s,i}$ , in which $i$ is the pixel index, $L$ is the global illumination, $C_i$ is the sensor sensitivity. The chromatic terms $R_b$ and $R_s$ account for body and surface reflection, which are only related to object material.
gray pixels: pixels with equal RGB values. detecting gray pixels in a color-biased image is not easy. [3]
albedo, shading, gloss [4] [5]

Reference

[1] Shafer, Steven A. “Using color to separate reflection components.” Color Research & Application 10.4 (1985): 210-218.
[2] Song, Shuangbing, et al. “Illumination Harmonization with Gray Mean Scale.” Computer Graphics International Conference. Springer, Cham, 2020.
[3] Qian, Yanlin, et al. “On finding gray pixels.” CVPR, 2019.
[4] Bhattad, Anand, and David A. Forsyth. “Cut-and-Paste Neural Rendering.” arXiv preprint arXiv:2010.05907 (2020).
[5] Yu, Ye, and William AP Smith. “InverseRenderNet: Learning single image inverse rendering.” CVPR, 2019.

Camera Survey

Posted on 2026-03-17 Edited on 2022-04-08 In hardware , camera

Interface Type:

GigE and USB interfaces are commonly used. The advantage of GigE is long-distance transmission.

Color v.s. Monochrome

When the exposure begins, each photosite is uncovered to collect incoming light. When the exposure ends, the occupancy of each photosite is read as an electrical signal, which is then quantified and stored as a numerical value in an image file.

Unlike color sensors, monochrome sensors capture all incoming light at each pixel regardless of color.
Unlike with color, monochrome sensors also do not require demosaicing to create the final image because the values recorded at each photosite effectively just become the values at each pixel. As a result, monochrome sensors are able to achieve a slightly higher resolution.

Sensor Type:

CCD (Charged Coupling Devices): special manufacturing process that allows the conversion to take place in the chip without distortion, which makes them more expensive. CCD can capture high-quality image with low noise and is sensitive to light.

CMOS (Complimentary Metal Oxide Semiconductor): use transistors at each pixel to move the charge through traditional wires. Traditional manufacturing processes are used to make CMOS, which is the same as creating microchips. CMOS is cheaper and has low power consumption

Readout Method:

Global v.s. rolling shutter: originally, CCD uses global shutter while CMOS uses rolling shutter. Rolling shutter is always active and rolling through the pixels line by line from top to bottom. In contrast, global shutter stores their electrical charges and reads out when the shutter is closed and the pixel is reset for the next exposure, allowing the entire sensor area to be output simultaneously. Nowadays, CMOS can also have global shutter capabilities.

Advantage of global shutter: global shutter can manage motions and pulsed light conditions rather well as the scene is viewed or exposed at one moment in time by enabling synchronous timing of the light or motion to the open shutter phase. However, rolling shutter can also manage motions and pulsed light conditions to an extent through a combination of fast shutter speeds and timing of the light source.

Quantum Efficiency

The ability of a pixel to convert an incident photon to charge is specified by its quantum efficiency. For example, if for ten incident photons, four photo-electrons are produced, then the quantum efficiency is 40%. Typical values of quantum efficiency are in the range of 30 - 60%. The quantum efficiency depends on wavelength and is not necessarily uniform over the response to light intensity.

Field of View

FOV (Field of View) depends on the lens size. Generally, larger sensors yield greater FOV.

Pixel Size

A small pixel size is desirable because it results in a smaller die size and/or higher spatial resolution; a large pixel size is desirable because it results in higher dynamic range and signal-to-noise ratio.

GUI Agent

Posted on 2026-03-17 Edited on 2026-06-29 In paper note

Research Directions

Safety

[1] Chen, Baicheng, et al. “AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents.” CVPR, 2026. [pdf]

[2] Yan, Zihe, et al. “Lasm: Layer-wise scaling mechanism for defending pop-up attack on gui agents.” CVPR, 2026. [pdf] [code]

Inference Efficiency

[1] Zhou, Xurui, et al. “Hiconagent: History context-aware policy optimization for gui agents.” CVPR, 2026. [pdf] [code] (dynamic history length)

[2] Mehrotra, Sarthak, et al. “ishift: Lightweight slow-fast gui agent with adaptive perception.” CVPR, 2026. [pdf]

Long Horizon

[1] Deng, Zehao, et al. “Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation.” CVPR, 2026. [pdf] [code]

[2] Kang, Bin, et al. “LongHorizonUI: A Unified Framework for Robust Long-Horizon Task Automation of GUI Agent.” ICLR, 2026. [pdf]

[3] Zeng, Ziyun, et al. “MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents.” arXiv preprint arXiv:2605.18652 (2026). [pdf]

[4] Zhou, Bowen, et al. “Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression.” arXiv preprint arXiv:2603.00188 (2026). [pdf]

[5] Wang, Jihong, et al. “ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution.” ACL Industry Track. 2026. [pdf]

[6] Lu, Zhengxi, et al. “UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization.” arXiv preprint arXiv:2604.13822 (2026). [pdf]

Extra Guidance

[1] Xie, Rui, et al. “GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation.” arXiv preprint arXiv:2603.26266 (2026). [pdf] [code]

[2] Liu, Jingjing, et al. “DocOS: Towards Proactive Document-Guided Actions in GUI Agents.” ICML, 2026. [pdf] [code]

[3] Einsia. “Scalable Behaviour Cloning on Browser Using via Skill Distillation.” [pdf] [code]

New Domain

[1] Li, Yang, et al. “Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents.” CVPR, 2026. [pdf]

[2] Chen, Yuxi, et al. “CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training.” ICML, 2026. [pdf] [code]

[3] Liu, Ziwei, et al. “Continual GUI Agents.” ICML, 2026. [pdf] [code]

Training Data Synthesis

[1] Zhang, Bofei, et al. “Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents.” AAAI, 2026. [pdf] [code]

[2] Shao, Rui, et al. “Hats: Hardness-aware trajectory synthesis for gui agents.” CVPR, 2026. [pdf] [code]

[3] Lv, Rui, et al. “M $^ 2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining.” ICLR, 2026. [pdf]

[4] Xiong, Weimin, et al. “Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining.” ICML, 2026. [pdf] [code]

Optimization

[1] Xu, Yifan, et al. “Mobilerl: Online agentic reinforcement learning for mobile gui agents.” arXiv preprint arXiv:2509.18119 (2025). [pdf] [code]

World Model

[1] Guan, Yiming, et al. “Computer-using world model.” arXiv preprint arXiv:2602.17365 (2026). [pdf]

[2] Luo, Dezhao, et al. “Vimo: A generative visual gui world model for app agents.” arXiv preprint arXiv:2504.13936 (2025). [pdf] [code]

[3] Cao, Yilin, et al. “MobileDreamer: Generative Sketch World Model for GUI Agent.” arXiv preprint arXiv:2601.04035 (2026). [pdf]

Challenging Benchmark

[1] Gong, Yichen, et al. “VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics.” ICML, 2026. [pdf] [code]

Survey

Zoom in

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

(1) Zoom in a bounding box [1] [2]

(2) Zoom in salient region [3] [4]

relation to (1): if the salience region is rectangle and salience value is infinity, this should be equivalent to zooming in a bounding box.
relation to pooling: weighted pooling with salience map as weight map
relation to deformable CNN: use salience map to calculate offset for each position

Reference

[1] Fu, Jianlong, Heliang Zheng, and Tao Mei. “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition.” CVPR, 2017.

[2] Zheng, Heliang, et al. “Learning multi-attention convolutional neural network for fine-grained image recognition.” ICCV, 2017.

[3] Recasens, Adria, et al. “Learning to zoom: a saliency-based sampling layer for neural networks.” ECCV, 2018.

[4] Zheng, Heliang, et al. “Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-grained Image Recognition.” arXiv preprint arXiv:1903.06150 (2019).

Word Vector

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Survey

For a brief survey summarizing skip-gram, CBOW, GloVe, etc, please refer to this.

Code

word2vec: TensorFlow

GloVe: C, TensorFlow

WikiCorpus

Download the WikiCorpus and use the shellscript to process (e.g., remove numbers, invalide chars, urls), leading to sequence of pure words.

Resources

English word vectors: https://github.com/3Top/word2vec-api
Non-English word vectors: https://github.com/Kyubyong/wordvectors

Visual Object Tracking

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Problem

Tracking is challenging due to the following factors: deformation, illumination variation, blur&fast motion, background clutter, rotation, scale, boundary effect

History

Tracking methods can be roughly categorized into generative methods and discriminative methods(feature+machine learning). Recently, correlation filter based methods and deep learning methods are dominant.

Meanshift: density based, ASMS https://github.com/vojirt/asms
Particle filter: particle based statistical method
Optical flow: match feature points between neighboring frames
correlation filter: KCF, DCF, CSK, CN, DSST, SRDCF, ECO. Basic CF methods are sensitive to deformation, fast motion, and boundary effect.
deep learning: GOTURN, MDNet, TCNN, SiamFC

Two research groups contribute to CF methods most:

Oxford: https://www.robots.ox.ac.uk/~luca/,
Linkoping: http://users.isy.liu.se/en/cvl/marda26/

Comparison of Speed and Performance

Survey papers

Object tracking: A survey, 2006
Object tracking benchmark, 2015

Benchmark

OTB50/100: http://cvlab.hanyang.ac.kr/tracker_benchmark/
VOT2016: http://www.votchallenge.net/vot2016/dataset.html

Challenge

Visual Object Tracking (VOT) challenge:
http://www.votchallenge.net/challenges.html
VOT2016 has released the code of many trackers: http://votchallenge.net/vot2016/trackers.html
Multiple Object Tracking Challenge (MOT) challenge:
https://motchallenge.net/

Detection based Tracking

Detection based tracking is also named as tracking by detection or multiple object tracking. (MOT Challenge)

TLD (tracking-learning-detection): update tracker and detector during learning
http://personal.ee.surrey.ac.uk/Personal/Z.Kalal/tld.html