Notes on Core ML Techniques

  1. Kullback-Leibler Divergence
  2. What is the main difference between GAN and autoencoder
  3. What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

  4. Knowledge Distillation
  5. Confidence penalty & Label Smoothing && Ouput Regularisation

  6. Uncertainty
  7. Long-tailed Recognition-Sample Imbalance
  8. Meta-learning
  9. Ensemble methods

Knowledge Distillation

  • Distilling the Knowledge in a Neural Network
    • Knowledge definition: A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.

    • An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.

    • A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions => cumbersome and may be too computationally expensive

    • Compress/distill the knowledge in an ensemble into a single model which is much easier to deploy (distilling the knowledge in an ensemble of models into a single model);

Confidence penalty & Label Smoothing && Ouput Regularisation

  • Regularizing Neural Networks by Penalizing Confident Output Distributions
    • Output regularisation: Regularizing the output distribution of large, deep neural networks has largely been unexplored. Output regularization has the property that it is invariant to the parameterization of the underlying neural network.
    • Knowledge definition: To motivate output regularizers, we can view the knowledge of a model as the conditional distribution it produces over outputs given an input (Hinton et al., 2015) as opposed to the learned values of its parameters.
    • Distillation definition: explicitly training a small network to assign the same probabilities to incorrect classes as a large network or ensemble of networks that generalizes well.
    • Two output regularizers:
      • A maximum entropy based confidence penalty;
      • Label smoothing (uniform and unigram).
      • We connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence.
    • ANNEALING AND THRESHOLDING THE CONFIDENCE PENALTY
      • Suggesting a confidence penalty that is weak at the beginning of training and strong near convergence.
      • Only penalize output distributions when they are below a certain entropy threshold
  • Label/Objective smoothing:
  • Label-Smoothing Regularization proposed in Rethinking the Inception Architecture for Computer Vision-A mechanism for encouraging the model to be less confident.
    • Over-fitting
    • Reduces the ability of the model to adapt: bounded gradient
  • Virtual adversarial training (VAT) Distributional smoothing by virtual adversarial examples
    • Another promising smoothing regularizer. However, it has multiple hyperparameters, significantly more computation in grid-searching

Uncertainty

Long-tailed Recognition

  • DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION-ICLR2020
    • Representation Learning: We first train models to learn representations with different sampling strategies, including the standard instance-based sampling, class-balanced sampling and a mixture of them.
    • Classification: We study three different basic approaches to obtain a classifier with balanced decision boundaries, on top of the learned representations.

Meta-learning

Ensemble methods

  • Ensemble methods in machine learning
    • Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a weighted vote of their predictions.
    • The original ensemble method is Bayesian averaging but more recent algorithms include error correcting output coding, Bagging and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier.
      • Bayesian Voting Enumerating the Hypotheses.
      • Bagging: Bagging presents the learning algorithm with a training set that consists of a sample of $m$ training examples drawn randomly with replacement from the original training set of $m$ items.

Kullback-Leibler Divergence

What is the main difference between GAN and autoencoder?

What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

# #

© 2019-2023. All rights reserved.

Welcome to Xinshao Wang's Personal Website