Notes on Core ML Techniques

  1. Kullback-Leibler Divergence
  2. What is the main difference between GAN and autoencoder
  3. What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

  4. Knowledge Distillation
  5. Confidence penalty & Label Smoothing && Ouput Regularisation

  6. Uncertainty
  7. Long-tailed Recognition-Sample Imbalance
  8. Meta-learning
  9. Ensemble methods

Knowledge Distillation

  • Distilling the Knowledge in a Neural Network
    • Knowledge definition: A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.

    • An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.

    • A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions => cumbersome and may be too computationally expensive

    • Compress/distill the knowledge in an ensemble into a single model which is much easier to deploy (distilling the knowledge in an ensemble of models into a single model);

Confidence penalty & Label Smoothing && Ouput Regularisation

  • Regularizing Neural Networks by Penalizing Confident Output Distributions
    • Output regularisation: Regularizing the output distribution of large, deep neural networks has largely been unexplored. Output regularization has the property that it is invariant to the parameterization of the underlying neural network.
    • Knowledge definition: To motivate output regularizers, we can view the knowledge of a model as the conditional distribution it produces over outputs given an input (Hinton et al., 2015) as opposed to the learned values of its parameters.
    • Distillation definition: explicitly training a small network to assign the same probabilities to incorrect classes as a large network or ensemble of networks that generalizes well.
    • Two output regularizers:
      • A maximum entropy based confidence penalty;
      • Label smoothing (uniform and unigram).
      • We connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence.
    • ANNEALING AND THRESHOLDING THE CONFIDENCE PENALTY
      • Suggesting a confidence penalty that is weak at the beginning of training and strong near convergence.
      • Only penalize output distributions when they are below a certain entropy threshold
  • Label/Objective smoothing:
  • Label-Smoothing Regularization proposed in Rethinking the Inception Architecture for Computer Vision-A mechanism for encouraging the model to be less confident.
    • Over-fitting
    • Reduces the ability of the model to adapt: bounded gradient
  • Virtual adversarial training (VAT) Distributional smoothing by virtual adversarial examples
    • Another promising smoothing regularizer. However, it has multiple hyperparameters, significantly more computation in grid-searching

Uncertainty

Long-tailed Recognition

  • DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION-ICLR2020
    • Representation Learning: We first train models to learn representations with different sampling strategies, including the standard instance-based sampling, class-balanced sampling and a mixture of them.
    • Classification: We study three different basic approaches to obtain a classifier with balanced decision boundaries, on top of the learned representations.

Meta-learning

  • Confusion on the definition of Meta-learning

  • Few-shot Learning is an instantiation of Meta-learning
  • MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition
    • Noisy Weighting: estimate the confidence scores of all the images in the noisy subset; MetaCleaner compares these representations in the feature space => discover relations between images => generate the confidence score of each image in the subset.
    • Clean Hallucinating: to hallucinate a `clean‘ representation of a class from the noisy subset, by summarizing the noisy images with their confidence scores;
    • MetaCleaner as a new layer before classifier: batch size K×N=>K, K categories, N images per class in the batch.
    • Different from prototypical network, our MetaCleaner mainly develops a robust classifier to reduce confusion of noisy labels. Hence, it adaptively uses the weighted prototype as a ‘clean’ representation to generalize softmax classifier, instead of using the mean prototype to construct a metric classifier for low-shot learning.
    • Why is this called meta-learning?
  • Learning to Learn From Noisy Labeled Data-CVPR 2019
    • My Understanding: https://github.com/LiJunnan1992/MLNT/issues/1
      • Iteratively Improve the Teacher/Oracle == Soft Target
      • Meta-obejctive-training/testing: The meta-training sees synthetic noisy training examples. After training on them, the meta-testing evaluates its consistency with oracle and aims to maximise the consistency, i.e., making it unaffected after seeing synthetic noise.
    • Reddit Analysis: Extremely complex in practice. However, the ideas are interesting and novel.
  • Learning to Reweight Examples for Robust Deep Learning-ICML 2018- Simultaneously minimize the loss on a clean unbiased validation set.
    • Meta-objective: a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions.
    • Solution: Suppose that a pair of training and validation examples are very similar, and they also provide similar gradient directions, then this training example is helpful and should be up-weighted, and conversely, if they provide opposite gradient directions, this training example is harmful and should be downweighed.
  • Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting-NeurIPS 2019
    • The major difference with Learning to Reweight Examples is that the weights are implicitly learned there, without an explicit weighting function.
      • I am skeptical and not convinced here!

Ensemble methods

  • Ensemble methods in machine learning
    • Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a weighted vote of their predictions.
    • The original ensemble method is Bayesian averaging but more recent algorithms include error correcting output coding, Bagging and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier.
      • Bayesian Voting Enumerating the Hypotheses.
      • Bagging: Bagging presents the learning algorithm with a training set that consists of a sample of m training examples drawn randomly with replacement from the original training set of m items.

Kullback-Leibler Divergence

What is the main difference between GAN and autoencoder?

What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

# #

© 2019-2023. All rights reserved.

Welcome to Xinshao Wang's Personal Website