Newly Blog

Weak-shot Object Detection

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Weak-shot object detection is also called cross-supervised or mixed-supervised object detection. Specifically, all categories are splitted into base categories and novel categories. Base categories have box-level annotation while novel categories only have image-level annotations.

Transfer common objectness: [1] [3]
Transfer the mapping from inaccurate bounding boxes to accurate bounding boxes: [2]

Reference

[1] Zhong, Yuanyi, et al. “Boosting weakly supervised object detection with progressive knowledge transfer.” European Conference on Computer Vision. Springer, Cham, 2020.

[2] Chen, Zitian, et al. “Cross-Supervised Object Detection.” arXiv preprint arXiv:2006.15056 (2020).

[3] Li, Yan, et al. “Mixed supervised object detection with robust objectness transfer.” IEEE transactions on pattern analysis and machine intelligence 41.3 (2018): 639-653.

Video Object Segmentation

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Given the segmentation mask of the first frame of a video clip, predict the segmentation masks in the subsequent frames.

Davis challenge https://davischallenge.org/ held since 2017, related papers [1] [2]
YouTube-VOS: A Large-Scale Benchmark for Video Object Segmentation https://youtube-vos.org/home
GyGO: an E-commerce Video Object Segmentation Dataset by Visualead https://github.com/ilchemla/gygo-dataset

Reference:

Perazzi, Federico, et al. “A benchmark dataset and evaluation methodology for video object segmentation.” CVPR, 2016.
Pont-Tuset, Jordi, et al. “The 2017 davis challenge on video object segmentation.” arXiv preprint arXiv:1704.00675 (2017).

Video Instance Segmentation

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Track-by-Detect: MaskTrack R-CNN [1]
Clip-Match: Vistr [2]
Propose-Reduce: [3]

Reference

[1] Yang, Linjie, Yuchen Fan, and Ning Xu. “Video instance segmentation.” ICCV, 2019.

[2] Wang, Yuqing, et al. “End-to-end video instance segmentation with transformers.” CVPR, 2021.

[3] Lin, Huaijia, et al. “Video instance segmentation with a propose-reduce paradigm.” ICCV, 2021.

Scene Text Recognition

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Scene text detection and recognition are challenging due to the following issues: scattered and sparse, blur, illumination, partial occlusion, multi-oriented, multi-lingual.

Scene text detection:

The detection methods can be grouped into proposal-based method and part-based method.

Paper list (in chronological order):

Detecting Text in Natural Scenes with
Stroke Width Transform, CVPR 2010: assume consistent stroke width within each character
Detecting Texts of Arbitrary Orientations in Natural Images, CVPR 2012: design rotation-invariant features
Deep Features for Text Spotting, ECCV 2014: add three branches for prediction
Robust scene text detection with convolution neural network induced mser trees, ECCV 2014
Real-time Lexicon-free Scene Text
Localization and Recognition, T-PAMI 2016
Reading Text in the Wild with Convolutional Neural Networks, IJCV 2016
Synthetic Data for Text Localisation in Natural Images, CVPR 2016: directly predict the bounding boxes, generate synthetic dataset
Multi-oriented text detection with fully convolutional networks, CVPR 2016
Detecting Text in Natural Image with Connectionist Text Proposal Network, ECCV 2016: look for text lines an fine vertical text pieces. sliding windows fed to Bi-LSTM.
SSD: single shot multibox detector, ECCV 2016
Reading Scene Text in Deep Convolutional Sequences, AAAI 2016
Scene text detection via holistic, multi-channel prediction, arxiv 2016: holistic and pixel-wise predictions on text region map, character map, and linking
orientation map
Deep Direct Regression for Multi-Oriented Scene Text Detection, ICCV 2017
WordSup: Exploiting Word Annotations for Character based Text Detection, ICCV 2017: a weakly supervised framework that can utilize word annotations for character detector training
TextBoxes: A Fast Text Detector with a Single Deep Neural Network, AAAI 2017
Detecting Oriented Text in Natural Images by Linking Segments, CVPR 2017: detect text with segments and links
EAST: An Efficient and Accurate Scene Text Detector, CVPR 2017: use DenseBox to generate quadrangle proposals
TextBoxes++: A Single-Shot Oriented Scene Text Detector, TIP 2018: extension of TextBoxes
Rotation-sensitive Regression for Oriented Scene Text Detection, CVPR 2018: rotation-sensitive feature maps for regression and rotation-invariant features for classification
Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation, CVPR 2018: combine corner localization and region segmentation
PixelLink: Detecting Scene Text via Instance Segmentation, AAAI 2018: rectangle enclosing instance segmentation mask, which is obtained based on text/non-text prediction and link prediction.
Arbitrary-Oriented Scene Text Detection via Rotation Proposals, TMM 2018: generate rotated proposals
TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes, ECCV 2018: infer the center line area (TCL) and associated circle radius/rotation

Scene text recognition:

The recognition methods can be grouped into character-level, word-level, and sequence-level.

Paper list (in chronological order):

End-to-End Scene Text Recognition, ICCV 2011: detection using Random Ferns and recognition via Pictorial Structure with a Lexicon
Top-down and bottom-up cues for scene text recognition, CVPR 2012: construct a CRF model to impose both bottom-up (i.e. character detections) and top-down (i.e. language statistics) cues
Scene text recognition using part-based tree-structured character detection, CVPR 2013: build a CRF model to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework
PhotoOCR: Reading text in uncontrolled conditions, ICCV 2013: automatically generate training data and perform OCR on web images
Label embedding: A frugal baseline for text recognition, IJCV 2015: learn a common space for image and word
Reading Text in the Wild with Convolutional Neural Networks, IJCV 2016
Robust Scene Text Recognition with Automatic Rectification, CVPR 2016
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild, CVPR 2016: character-level language model embodied in a recurrent neural network
An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, T-PAMI 2017
Focusing Attention: Towards Accurate Text Recognition in Natural Images, ICCV 2017: Focusing Network to handle the attention drift
Visual attention models for scene text recognition, 2017 arxiv
AON: Towards Arbitrarily-Oriented Text Recognition , CVPR 2018
(recommended by Guo)An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, T-PAMI, 2017

End-to-end

Integrate scene text detection and recognition in an end-to-end system.

Paper list (in chronological order):

A method for text localization and recognition in real-world images, ACCV 2010
Real-Time Scene Text Localization and Recognition, CVPR 2012
Towards End-to-end Text Spotting with Convolutional
Recurrent Neural Networks, ICCV 2017: designed for horizontal scene text
Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework, ICCV 2017: detect and recognize horizontal and multioriented
scene text
FOTS: Fast Oriented Text Spotting with a Unified Network, CVPR 2018: using EAST as text detector and CRNN as text recognizer

Datasets

Surveys

Scene text detection and recognition: Recent advances and future trends, FCS 2015
Text detection and recognition in imagery: A survey, T-PAMI 2015

Special Sessions

Use Spatial Transformation Network (STN) [1] [2] [3] [4]
Use Deformable Convolution Network (DCN) [1]

Scale Variation for Object Detection

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

This problem is well discussed in https://arxiv.org/pdf/1506.01497.pdf. Different schemes for addressing multiple scales and sizes: (a) multi-scale input images (b) multi-scale feature maps (c) multi-scale anchor boxes on one feature map.

The first way is based on image/feature pyramids, e.g., in DPM and CNN-based methods. The images are resized at multiple scales, and feature maps (HOG or deep convolutional features) are computed for each scale. This way is often useful but is time-consuming.
The second way is to use sliding windows of multiple scales (and/or aspect ratios) of the feature maps. For example, in DPM, models of different aspect ratios are trained separately using different filter sizes. If this way is used to address multiple scales, it can be thought of as a “pyramid of filters”. The second way is usually adopted jointly with the first way.
As a comparison, our anchor-based method is built on comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. We show by experiments the effects of this scheme for addressing multiple scales and sizes. Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image, as is also done by the Fast R-CNN detector. The design of multi- scale anchors is a key component for sharing features without extra cost for addressing scales.
use different dilation rates to vary receptive fields
use feature pyramid [1]

Reference

[1] Lin, Tsung-Yi, et al. “Feature pyramid networks for object detection.” CVPR, 2017.

Object Detection

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

two-stage: use region proposal network (RPN) to generate proposals

faster-RCNN

one-stage: remove RPN and use anchors with associated fixed proposals based on predefined scales/aspect-ratios.

YOLO v1 v2 v3
SSD

Corner Points: remove anchors and directly predict corner points

CornerNet

No anchor: actually use aach cell as an anchor

RPDet: use object centers as positive cells; paired with deformable CNN
FoveaBox: use the cells in fovea area (object bounding box) as positive cells
Guided Anchoring: use deformable CNN to obtain adapted feature map

Object Detection Loss

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

Fast RCNN

$L(p,u,t^u,v) L_{cls} (p,u) + \lambda[u\geq 1]L_{loc}(t^u,v),$

where $p$ is $(K+1)$-dim class probability vector with 0 being the background class, $u$ is the groundtruth class, $v$ is the ground-truth regression tuple, and $t^u$ is the predicted regression tuple for class $u$. $L_{cls}$ is a multi-class softmax loss and $L_{loc}$ is a smooth L1 loss.

Faster RCNN

$L(p_i,t_i)=L_{cls} (p_i,p_i^*) + \lambda p_i^*L_{reg}(t_i,t_i^*),$

where $L_{cls}$ is a two-class (e.g., obj or not obg) (resp., multi-class) softmax loss for RPN (resp., gen) and $L_{reg}$ is a smooth L1 loss. So the loss of faster RCNN is basically the same as fast RCNN.

fast and faster RCNN generate proposals, so they have the pos/neg labels for anchor boxes. However, the following SSD and YOLO do not generate proposals, so they need to match anchor boxes with ground-truth boxes.

SSD

By using $x_{ij}^p$ as a binary indicator for matching the i-th default box to the j-th ground-truth box of category p. Multiple detection boxes can be matched to the same ground-truth box.

$l(x,c,l,g)=L_{conf}(x,c) + \alpha L_{loc}(x,l,g),$

where $L_{conf}$ is a (K+1)-class softmax loss, and

$L_{loc} (x,l,g)=\sum_i \sum_j x_{ij}^k |l_i-g_j|.$

YOLO

$\sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}}[(x_i-\hat{x}_i)^2+(y_i-\hat{y}_i)^2] + \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}}[(\sqrt{w_i}-\sqrt{\hat{w}_i)}^2+(\sqrt{h_i}-\sqrt{\hat{y}_i})^2] + \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}} (C_i-\hat{C}_i)^2+ \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{noobj}}(C_i-\hat{C}_i)^2 + \sum_{i=0}^{S^2}\sum_{j=0}^B \mathcal{1^{obj}}\sum_{c} (p_i(c)-\hat{p}_i(c))^2$

Note that for the noobj anchorboxes, there is only one loss term involved.

Line Detection

Posted on 2026-03-17 Edited on 2024-01-03 In paper note

[1] MLSD

Reference

[1] Gu, Geonmo, et al. “Towards light-weight and real-time line segment detection.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 1. 2022.

GNN for Segmentation

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

row-wise and column-wise LSTM on feature map: [1]
graph LSTM on superpixels: [2]
3D graph: [3]
DAG on feature map: [4]

Reference

Li, Zhen, et al. “Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling.” ECCV, 2016.
Liang, Xiaodan, et al. “Semantic object parsing with graph lstm.” ECCV, 2016.
Qi, Xiaojuan, et al. “3d graph neural networks for rgbd semantic segmentation. ICCV, 2017.
Ding, Henghui, et al. “Boundary-aware feature propagation for scene segmentation.” ICCV, 2019.

From Anchor to ROI

Posted on 2026-03-17 Edited on 2022-04-08 In paper note

layer area

From layer i to layer i+1, assume the parameters on layer i are $s_i$ (stride), $p_i$ (patch), $k_i$ (kernel filter size), the width or height of layer i are $r_i$. Then, based on common sense,

$r_{i+1} = (r_i+2p_i-k_i)/s_i+1.$

In the reverse process, $r_i = s_i r_{i+1}-s_i-2p_i+k_i$ or $r_i = s_i r_{i+1}-s_i+k_i$ if counting in padding area.

coordinate map

Now consider mapping the point $x_i$ on the ROI to the point $x_{i+1}$ on the feature map, which can be transformed to the layer area problem above. In particular, the receptive field formed by left-up corner and $x_i$ on the ROI can be mapped to the region formed by left-up corner and $x_{i+1}$ on the feature map. Based on the similar formula for the layer area problem above (note the only difference is that we only include left padding and up padding, and subtract the radius of kernel filter $(k_i-1)/2$,

$x_i=s_i x_{i+1}-s_i-p_i+k_i-(k_i-1)/2.$

The above coordinate system starts from 1. When the coordinate system starts from 0,

$x_i+1=s_i (x_{i+1}+1)-s_i-p_i+k_i-(k_i-1)/2,$

which can be simplified as

$x_i=s_i x_{i+1}+(\frac{k_i-1}{2}-p_i).$

when $p_i=floor(k_i/2)$, $x_i=s_i x_{i+1}$ approximately, which is the simplest case.

By applying $x_i=s_i x_{i+1}+(\frac{k_i-1}{2}-p_i)$ recursively, we can achieve a general solution

$x_1 = \alpha_L x_{L}+\beta_L,$

in which $\alpha_L = \prod_{l=1}^{L-1} s_l$ and $\beta_L=\sum_{l=1}^{L-1} (\prod_{n=1}^{l-1} s_n)(\frac{k_l-1}{2}-p_l) $

anchor box to ROI

Given two corner points of an anchor box on the feature map, we can find their corresponding points on the original image, which determine the ROI.