1. [1]: use KL divergence as the upper-bound of mutual information (MI), which can be used to minimize MI. r(z) can be set as unit Gaussian for simplicity.
  1. MINE[2]: lower-bound of MI based on KL divergence. Due to strong consistency, MINE can be used as a tight estimation of MI.

References

  1. Alemi, Alexander A., et al. “Deep variational information bottleneck.” arXiv preprint arXiv:1612.00410 (2016).

  2. Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., & Hjelm, D. Mutual information neural estimation, ICML, 2018.

Normalize weights:

  1. weight normalization [1]: $\mathbf{w}=\frac{g}{|\mathbf{v}|} \mathbf{v}$, weight normalization can be viewed as a cheaper and less noisy approximation to batch normalization

Normalize outputs:

  1. batch normalization [2]: make the input and output have the same variance

  2. layer normalization [3]

  3. instance normalization [4]

  4. group normalization [5]

N as the batch axis, C as the channel axis, and (H, W)
as the spatial axes

[1] Salimans T, Kingma D P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks[C]//Advances in Neural Information Processing Systems. 2016: 901-909.

[2] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[J]. arXiv preprint arXiv:1502.03167, 2015.

[3] Ba J L, Kiros J R, Hinton G E. Layer normalization[J]. arXiv preprint arXiv:1607.06450, 2016.

[4] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.

[5] Wu Y, He K. Group normalization[J]. arXiv preprint arXiv:1803.08494, 2018.

Taxonomy

1) metric-based: learn a good metric

  • matching network [1]
  • relation network [2]
  • prototypical network [3] [4]

2) optimization-based: gradient

  • Meta-Learner LSTM [5]
  • MAML [6] [7] [8]
  • REPTILE (an approximation of MAML) [9]

    Optimization based methods aim to obtain good parameter initilization. If we simply train multiple tasks, the obtained model parameters may lead to sub optimum for each task.

3) model-based: predict model parameters

Reference:

  1. Vinyals, Oriol, et al. “Matching networks for one shot learning.” NIPS, 2016.
  2. Sung, Flood, et al. “Learning to compare: Relation network for few-shot learning.” CVPR, 2018.
  3. Snell, Jake, Kevin Swersky, and Richard Zemel. “Prototypical networks for few-shot learning.” NIPS, 2017.
  4. Ren, Mengye, et al. “Meta-learning for semi-supervised few-shot classification.” arXiv preprint arXiv:1803.00676 (2018).
  5. Sachin Ravi and Hugo Larochelle. “Optimization as a Model for Few-Shot Learning.” ICLR, 2017.
  6. Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-agnostic meta-learning for fast adaptation of deep networks.” ICML, 2017.
  7. Finn, Chelsea, and Sergey Levine. “Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm.” arXiv preprint arXiv:1710.11622 (2017).
  8. Grant, Erin, et al. “Recasting gradient-based meta-learning as hierarchical bayes.” arXiv preprint arXiv:1801.08930 (2018).
  9. A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv, 1803.02999v2, 2018.
  10. Adam Santoro, et al. “Meta-learning with memory-augmented neural networks.” ICML. 2016.
  11. Munkhdalai, Tsendsuren, and Hong Yu. “Meta networks.” ICML, 2017.

Tutorials:

  1. binary map

  2. frequency: DCT [1]

  3. PolarMask [2]

  4. Hyperbolic [3]

Reference

[1] Shen, Xing, et al. “Dct-mask: Discrete cosine transform mask representation for instance segmentation.” CVPR, 2021.

[2] Xie, Enze, et al. “Polarmask: Single shot instance segmentation with polar representation.” CVPR, 2020.

[3] GhadimiAtigh, Mina, et al. “Hyperbolic Image Segmentation.” arXiv preprint arXiv:2203.05898 (2022).

Regulating latent variables or latent features can improve the generalizability of classifier and lower the error bound.

Regulating latent variables is essentially decrease the entropy of latent variables. There are some common tricks to decrease the entropy of latent variables, for example,

  1. dropout
  2. weight decay
  3. add random noise to the latent variables in VAE and GAN.
  4. add random perturbation to model parameters

For theoretical proof, please refer to here.

  1. label smoothing: [1] interpolating ground-truth label and uniform label

  2. bootstrapping: [2] interpolate noisy label and label from previous iteration

  3. noisy data+clean data: [3] interpolate noisy label and distilled label

[1] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” CVPR, 2016.

[2] Reed, Scott, et al. “Training deep neural networks on noisy labels with bootstrapping.” arXiv preprint arXiv:1412.6596 (2014).

[3] Li, Yuncheng, et al. “Learning from noisy labels with distillation.” ICCV, 2017.

Definition: entities, attributes, and relationships

Two ways to construct knowledge graph:

  1. probabilistic models (graphical model/random walk)

  2. embedding based models

1) Approximate incremental SVM: pass through the dataset many times

2) Exact incremental or decremental SVM: only pass through the dataset once

  1. simulates an infinite-depth network by fixed point iteration $h=f_{\theta}(h;x)$, in which $x$ is initial input, $\theta$ is the model parameter of one-time transformation. After infinite times of transformations, $x$ will approach the fixed point $h$. DEQ[1], MDEQ[2], iFPN[3]

Reference:

  1. Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. “Deep equilibrium models.” Advances in Neural Information Processing Systems. 2019.
  2. Bai, Shaojie, Vladlen Koltun, and J. Zico Kolter. “Multiscale deep equilibrium models.” arXiv preprint arXiv:2006.08656 (2020).
  3. Wang, Tiancai, Xiangyu Zhang, and Jian Sun. “Implicit Feature Pyramid Network for Object Detection.” arXiv preprint arXiv:2012.13563 (2020).
0%