Newly Blog


  • Home

  • Tags

  • Categories

  • Archives

  • Search

semi-supervised Learning

Posted on 2022-06-16 | In paper note

Surveys

An Overview of Deep Semi-Supervised Learning

Semi-supervised learning

  1. mixmatch [4]: gracefully unify data augmentation, sharpening (low entropy), mixup.

  2. unsupervised data augmentation [5] [code]

co-training for semi-supervised Learning

  1. multi-view: co-training [1], tri-net [2]

  2. multi-graph: label propagation [3]

Reference

[1] Deep Co-Training for Semi-Supervised Image Recognition

[2] Tri-net for Semi-Supervised Deep Learning

[3] Consensus-Driven Propagation in Massive Unlabeled Data for Face Recognition

[4] Berthelot, David, et al. “Mixmatch: A holistic approach to semi-supervised learning.” arXiv preprint arXiv:1905.02249 (2019).

[5] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, Quoc V. Le, “Unsupervised Data Augmentation for Consistency Training.” arXiv preprint arXiv:1904.12848 (2019).

Self-supervised Learning

Posted on 2022-06-16 | In paper note

Design a proxy task using unlabeled or weakly-labeled data to help the original task. Essentially, self-supervised learning is multi-task learning with the proxy task not relying on heavy human annotation. The problem is which proxy task without human annotation is the most effective one.

Please refer to the tutorial slides [1] [2], the survey, and the paper list.

  1. image-to-image

    • image-to-image translation: colorization [1], inpainting [2], cross-channel generation [3]
    • spatial location: relative location [1], jigsaw [2], predicting rotation [3]
    • contrastive learning: instance-wise contrastive learning (e.g., MOCO), prototypical contrastive learning (clustering) [1] [2]
    • MAE: Siamese MAE
  1. video-to-image

    • temporal coherence: [1] [2] [3]
    • temporal order: [1] [2] [3]
    • unsupervised image tasks with video clues: clustering [1], optical flow prediction [1], unsupervised segmentation based on optical flow [1] [2],unsupervised depth estimation based on optical flow [2]
    • video generation [1]
    • cross-modal consistency: consistency between visual kernel and optical flow kernel [1]
  2. video-to-video: all video-to-image methods can be used for video-to-video by averaging frame features.

    • 3D rotation [1]
    • Cubic puzzle [1]
    • video localization and classification [1]

Muti-task self-supervised learning: integrate multiple proxy tasks [1] [2]

Combined with other frameworks: self-supervised GAN [1]

A recent paper [1*] claims that the best self-supervised learning method is still the earliest image inpainting model. The design of network architecture has a significant impact on the performance of self-supevivsed learning methods.

SimCLR [2*] is a SOTA self-supervised learning method with performance approaching supervised learning.

Reference

[1*] Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer: Revisiting Self-Supervised Visual Representation Learning. CVPR 2019.

[2*] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” arXiv preprint arXiv:2002.05709 (2020).

Scene Text Recognition

Posted on 2022-06-16 | In paper note

Scene text detection and recognition are challenging due to the following issues: scattered and sparse, blur, illumination, partial occlusion, multi-oriented, multi-lingual.

Scene text detection:

The detection methods can be grouped into proposal-based method and part-based method.

Paper list (in chronological order):

  1. Detecting Text in Natural Scenes with
    Stroke Width Transform
    , CVPR 2010: assume consistent stroke width within each character

  2. Detecting Texts of Arbitrary Orientations in Natural Images, CVPR 2012: design rotation-invariant features

  3. Deep Features for Text Spotting, ECCV 2014: add three branches for prediction

  4. Robust scene text detection with convolution neural network induced mser trees, ECCV 2014

  5. Real-time Lexicon-free Scene Text
    Localization and Recognition
    , T-PAMI 2016

  6. Reading Text in the Wild with Convolutional Neural Networks, IJCV 2016

  7. Synthetic Data for Text Localisation in Natural Images, CVPR 2016: directly predict the bounding boxes, generate synthetic dataset

  8. Multi-oriented text detection with fully convolutional networks, CVPR 2016

  9. Detecting Text in Natural Image with Connectionist Text Proposal Network, ECCV 2016: look for text lines an fine vertical text pieces. sliding windows fed to Bi-LSTM.

  10. SSD: single shot multibox detector, ECCV 2016

  11. Reading Scene Text in Deep Convolutional Sequences, AAAI 2016

  12. Scene text detection via holistic, multi-channel prediction, arxiv 2016: holistic and pixel-wise predictions on text region map, character map, and linking
    orientation map

  13. Deep Direct Regression for Multi-Oriented Scene Text Detection, ICCV 2017

  14. WordSup: Exploiting Word Annotations for Character based Text Detection, ICCV 2017: a weakly supervised framework that can utilize word annotations for character detector training

  15. TextBoxes: A Fast Text Detector with a Single Deep Neural Network, AAAI 2017

  16. Detecting Oriented Text in Natural Images by Linking Segments, CVPR 2017: detect text with segments and links

  17. EAST: An Efficient and Accurate Scene Text Detector, CVPR 2017: use DenseBox to generate quadrangle proposals

  18. TextBoxes++: A Single-Shot Oriented Scene Text Detector, TIP 2018: extension of TextBoxes

  19. Rotation-sensitive Regression for Oriented Scene Text Detection, CVPR 2018: rotation-sensitive feature maps for regression and rotation-invariant features for classification

  20. Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation, CVPR 2018: combine corner localization and region segmentation

  21. PixelLink: Detecting Scene Text via Instance Segmentation, AAAI 2018: rectangle enclosing instance segmentation mask, which is obtained based on text/non-text prediction and link prediction.

  22. Arbitrary-Oriented Scene Text Detection via Rotation Proposals, TMM 2018: generate rotated proposals

  23. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes, ECCV 2018: infer the center line area (TCL) and associated circle radius/rotation

Scene text recognition:

The recognition methods can be grouped into character-level, word-level, and sequence-level.

Paper list (in chronological order):

  1. End-to-End Scene Text Recognition, ICCV 2011: detection using Random Ferns and recognition via Pictorial Structure with a Lexicon

  2. Top-down and bottom-up cues for scene text recognition, CVPR 2012: construct a CRF model to impose both bottom-up (i.e. character detections) and top-down (i.e. language statistics) cues

  3. Scene text recognition using part-based tree-structured character detection, CVPR 2013: build a CRF model to incorporate the detection scores, spatial constraints and linguistic knowledge into one framework

  4. PhotoOCR: Reading text in uncontrolled conditions, ICCV 2013: automatically generate training data and perform OCR on web images

  5. Label embedding: A frugal baseline for text recognition, IJCV 2015: learn a common space for image and word

  6. Reading Text in the Wild with Convolutional Neural Networks, IJCV 2016

  7. Robust Scene Text Recognition with Automatic Rectification, CVPR 2016

  8. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild, CVPR 2016: character-level language model embodied in a recurrent neural network

  9. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, T-PAMI 2017

  10. Focusing Attention: Towards Accurate Text Recognition in Natural Images, ICCV 2017: Focusing Network to handle the attention drift

  11. Visual attention models for scene text recognition, 2017 arxiv

  12. AON: Towards Arbitrarily-Oriented Text Recognition , CVPR 2018

  13. (recommended by Guo)An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, T-PAMI, 2017

End-to-end

Integrate scene text detection and recognition in an end-to-end system.

Paper list (in chronological order):

  1. A method for text localization and recognition in real-world images, ACCV 2010

  2. Real-Time Scene Text Localization and Recognition, CVPR 2012

  3. Towards End-to-end Text Spotting with Convolutional
    Recurrent Neural Networks
    , ICCV 2017: designed for horizontal scene text

  4. Deep TextSpotter: An End-To-End Trainable Scene Text Localization and Recognition Framework, ICCV 2017: detect and recognize horizontal and multioriented
    scene text

  5. FOTS: Fast Oriented Text Spotting with a Unified Network, CVPR 2018: using EAST as text detector and CRNN as text recognizer

Datasets

  • RCTW-17
  • MLT
  • SCUT-CTW1500
  • Total-Text
  • ICDAR 2015
  • MSRA-TD500)
  • IIIT 5K-Word
  • COCO-Text

Surveys

  • Scene text detection and recognition: Recent advances and future trends, FCS 2015
  • Text detection and recognition in imagery: A survey, T-PAMI 2015

Special Sessions

  1. Use Spatial Transformation Network (STN) [1] [2] [3] [4]

  2. Use Deformable Convolution Network (DCN) [1]

Scale Variation for Object Detection

Posted on 2022-06-16 | In paper note

This problem is well discussed in https://arxiv.org/pdf/1506.01497.pdf. Different schemes for addressing multiple scales and sizes: (a) multi-scale input images (b) multi-scale feature maps (c) multi-scale anchor boxes on one feature map.

  1. The first way is based on image/feature pyramids, e.g., in DPM and CNN-based methods. The images are resized at multiple scales, and feature maps (HOG or deep convolutional features) are computed for each scale. This way is often useful but is time-consuming.

  2. The second way is to use sliding windows of multiple scales (and/or aspect ratios) of the feature maps. For example, in DPM, models of different aspect ratios are trained separately using different filter sizes. If this way is used to address multiple scales, it can be thought of as a “pyramid of filters”. The second way is usually adopted jointly with the first way.

  3. As a comparison, our anchor-based method is built on comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. We show by experiments the effects of this scheme for addressing multiple scales and sizes. Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image, as is also done by the Fast R-CNN detector. The design of multi- scale anchors is a key component for sharing features without extra cost for addressing scales.

  4. use different dilation rates to vary receptive fields

  5. use feature pyramid [1]

Reference

[1] Lin, Tsung-Yi, et al. “Feature pyramid networks for object detection.” CVPR, 2017.

ROI Feature

Posted on 2022-06-16 | In paper note
  1. ROI pooling
  2. ROI alignment [1]
  3. Precise RoI Pooling [2]

Reference

  1. He, Kaiming, et al. “Mask R-CNN.” ICCV, 2017.
  2. Jiang, Borui, et al. “Acquisition of localization confidence for accurate object detection.” ECCV, 2018.

Repeated Patterns

Posted on 2022-06-16 | In paper note
  1. detect repeated patterns [1]

  2. inpaint corrupted images with repeated patterns [2]: use frequency convolution

Reference

[1] Louis Lettry, Michal Perdoch, Kenneth Vanhoey, and Luc Van Gool. Repeated pattern detection using cnn activations. In WACV, 2017

[2] Suvorov, Roman, et al. “Resolution-robust Large Mask Inpainting with Fourier Convolutions.” arXiv preprint arXiv:2109.07161 (2021).

Receptive Field

Posted on 2022-06-16 | In paper note
  1. Real receptive field is smaller than theorical receptive field, and shrinks by $\frac{1}{\sqrt{n}}$ with $n$ being the number of layers.

  2. Advanced networks (e.g., ResNet) have larger receptive field than old networks (e.g., AlexNet). In latest networks, the receptive field of each pixel in the last layer is as large as the whole image. Generally, larger receptive field leads to higher accuracy, but is not the only factor that influences the accuracy.

Fomoro: a website to calculate receptive field.

Distill: mathematical derivations and open-source library to compute receptive field.

Reference

  1. Wenjie Luo, Yujia Li, Raquel Urtasun, Richard S. Zemel:
    Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. NIPS, 2016.

Privileged Information

Posted on 2022-06-16 | In paper note

Learning Using Privileged Information (LUPI) or SVM+ was proposed by Vapnik in [the first paper].

High-level ideas:

  • Use privileged information in the same way as for multi-view learning
  • Transfer between privileged information and primary information
  • Use privileged information to control the training process like training uncertainty or training difficulty (e.g., training loss, noise).

Applications:

  • SVM for binary classification

    • model the slack variable : SVM+ [1]
    • model the margin: [1] [2]
    • structural SVM: [1]
    • theoretical analysis: [1] [2]
  • Gaussian process classification

    • GPC [1]
  • L2 loss for classification/Hash

    • multi-labeling [1]
    • Hash ITQ [1]
  • clustering

    • clustering [1]
  • metric learning for verification/classification

    • ITML+ [1] [2]
    • DML+ [1]
    • OITML [1]: ordinal-based ITML
  • CRF

    • probilistic inference [1]: similar with multi-view, but integral over the latent privileged information space during testing
  • random forest

    • conditional regression forest [1]: design node splitting criterion
  • matrix factorization for collaborative filtering

    • PriMF [1]
  • Maximum Entropy Discrimination

    • MED [1]
  • Deep Learning

    • Hallucination network
    • classification loss [1]
    • model drop-out [1]

Settings:

  • multi-view + LUPI [1]
  • multi-task multi-class LUPI [1]
  • multi-instance LUPI [1]
  • active learning + LUPI [1]
  • distillation + LUPI [1]
  • domain adaptation + LUPI [1]

Object Detection

Posted on 2022-06-16 | In paper note

two-stage: use region proposal network (RPN) to generate proposals

  1. faster-RCNN

one-stage: remove RPN and use anchors with associated fixed proposals based on predefined scales/aspect-ratios.

  1. YOLO v1 v2 v3
  2. SSD

Corner Points: remove anchors and directly predict corner points

  1. CornerNet

No anchor: actually use aach cell as an anchor

  1. RPDet: use object centers as positive cells; paired with deformable CNN

  2. FoveaBox: use the cells in fovea area (object bounding box) as positive cells

  3. Guided Anchoring: use deformable CNN to obtain adapted feature map

Object Detection Loss

Posted on 2022-06-16 | In paper note

Fast RCNN

where $p$ is $(K+1)$-dim class probability vector with 0 being the background class, $u$ is the groundtruth class, $v$ is the ground-truth regression tuple, and $t^u$ is the predicted regression tuple for class $u$. $L_{cls}$ is a multi-class softmax loss and $L_{loc}$ is a smooth L1 loss.

Faster RCNN

where $L_{cls}$ is a two-class (e.g., obj or not obg) (resp., multi-class) softmax loss for RPN (resp., gen) and $L_{reg}$ is a smooth L1 loss. So the loss of faster RCNN is basically the same as fast RCNN.

fast and faster RCNN generate proposals, so they have the pos/neg labels for anchor boxes. However, the following SSD and YOLO do not generate proposals, so they need to match anchor boxes with ground-truth boxes.

SSD

By using $x_{ij}^p$ as a binary indicator for matching the i-th default box to the j-th ground-truth box of category p. Multiple detection boxes can be matched to the same ground-truth box.

where $L_{conf}$ is a (K+1)-class softmax loss, and

YOLO

Note that for the noobj anchorboxes, there is only one loss term involved.

1…161718…24
Li Niu

Li Niu

237 posts
18 categories
112 tags
Homepage GitHub Linkedin
© 2025 Li Niu
Powered by Hexo
|
Theme — NexT.Mist v5.1.4