Notes on Core ML Techniques

14 Feb 2020 in Blogs

Kullback-Leibler Divergence
What is the main difference between GAN and autoencoder
What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?
Knowledge Distillation
Confidence penalty & Label Smoothing && Ouput Regularisation
Uncertainty
Long-tailed Recognition-Sample Imbalance
Meta-learning
Ensemble methods

Knowledge Distillation

Distilling the Knowledge in a Neural Network
- Knowledge definition: A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.
- An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.
- A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions => cumbersome and may be too computationally expensive
- Compress/distill the knowledge in an ensemble into a single model which is much easier to deploy (distilling the knowledge in an ensemble of models into a single model);

Confidence penalty & Label Smoothing && Ouput Regularisation

Regularizing Neural Networks by Penalizing Confident Output Distributions
- Output regularisation: Regularizing the output distribution of large, deep neural networks has largely been unexplored. Output regularization has the property that it is invariant to the parameterization of the underlying neural network.
- Knowledge definition: To motivate output regularizers, we can view the knowledge of a model as the conditional distribution it produces over outputs given an input (Hinton et al., 2015) as opposed to the learned values of its parameters.
- Distillation definition: explicitly training a small network to assign the same probabilities to incorrect classes as a large network or ensemble of networks that generalizes well.
- Two output regularizers:
  - A maximum entropy based confidence penalty;
  - Label smoothing (uniform and unigram).
  - We connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence.
- ANNEALING AND THRESHOLDING THE CONFIDENCE PENALTY
  - Suggesting a confidence penalty that is weak at the beginning of training and strong near convergence.
  - Only penalize output distributions when they are below a certain entropy threshold
Label/Objective smoothing:
- Smoothing the labels with a uniform distribution-Rethinking the Inception Architecture
- Smooth the labels with a teacher model Distilling, Hinton et al., 2015
- Smooth the labels with the model’s own distribution-TRAINING DEEP NEURAL NETWORKS ON NOISY LABELS WITH BOOTSTRAPPING (Reed et al., 2014)
- Adding label noise simply-Disturblabel: Regularizing cnn on the loss layer–CVPR 2016
- Distillation and self-distillation both regularize a network by incorporating information about the ratios between incorrect classes.
Label-Smoothing Regularization proposed in Rethinking the Inception Architecture for Computer Vision-A mechanism for encouraging the model to be less confident.
- Over-fitting
- Reduces the ability of the model to adapt: bounded gradient
Virtual adversarial training (VAT) Distributional smoothing by virtual adversarial examples
- Another promising smoothing regularizer. However, it has multiple hyperparameters, significantly more computation in grid-searching

Uncertainty

Long-tailed Recognition

DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION-ICLR2020
- Representation Learning: We first train models to learn representations with different sampling strategies, including the standard instance-based sampling, class-balanced sampling and a mixture of them.
- Classification: We study three different basic approaches to obtain a classifier with balanced decision boundaries, on top of the learned representations.

Meta-learning

Confusion on the definition of Meta-learning
Few-shot Learning is an instantiation of Meta-learning
MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition
- Noisy Weighting: estimate the confidence scores of all the images in the noisy subset; MetaCleaner compares these representations in the feature space => discover relations between images => generate the confidence score of each image in the subset.
- Clean Hallucinating: to hallucinate a `clean‘ representation of a class from the noisy subset, by summarizing the noisy images with their confidence scores;
- MetaCleaner as a new layer before classifier: batch size $K \times N => K$, $K$ categories, $N$ images per class in the batch.
- Different from prototypical network, our MetaCleaner mainly develops a robust classifier to reduce confusion of noisy labels. Hence, it adaptively uses the weighted prototype as a ‘clean’ representation to generalize softmax classifier, instead of using the mean prototype to construct a metric classifier for low-shot learning.
- Why is this called meta-learning?
Learning to Learn From Noisy Labeled Data-CVPR 2019
- My Understanding: https://github.com/LiJunnan1992/MLNT/issues/1
  - Iteratively Improve the Teacher/Oracle == Soft Target
  - Meta-obejctive-training/testing: The meta-training sees synthetic noisy training examples. After training on them, the meta-testing evaluates its consistency with oracle and aims to maximise the consistency, i.e., making it unaffected after seeing synthetic noise.
- Reddit Analysis: Extremely complex in practice. However, the ideas are interesting and novel.
Learning to Reweight Examples for Robust Deep Learning-ICML 2018- Simultaneously minimize the loss on a clean unbiased validation set.
- Meta-objective: a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions.
- Solution: Suppose that a pair of training and validation examples are very similar, and they also provide similar gradient directions, then this training example is helpful and should be up-weighted, and conversely, if they provide opposite gradient directions, this training example is harmful and should be downweighed.
Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting-NeurIPS 2019
- The major difference with Learning to Reweight Examples is that the weights are implicitly learned there, without an explicit weighting function.
  - I am skeptical and not convinced here!

Ensemble methods

Ensemble methods in machine learning
- Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a weighted vote of their predictions.
- The original ensemble method is Bayesian averaging but more recent algorithms include error correcting output coding, Bagging and boosting. This paper reviews these methods and explains why ensembles can often perform better than any single classifier.
  - Bayesian Voting Enumerating the Hypotheses.
  - Bagging: Bagging presents the learning algorithm with a training set that consists of a sample of $m$ training examples drawn randomly with replacement from the original training set of $m$ items.
  - …

Kullback-Leibler Divergence

How to approximate our data (choose a parameterized distribution => optimise its parameters): KL Divergence helps us to measure just how much information we lose when we choose an approximation compared with our observations.
- The most important metric in information theory is called Entropy, typically denoted as $\mathbf{H}$. The definition of Entropy for a probability distribution is: $\mathbf{H}=-\sum_{i=1}^{n} p(\mathbf{x}_i) \log p(\mathbf{x}_i) $.
- If we use $\log_2$ for our calculation we can interpret entropy as “the minimum number of bits it would take us to encode our information”.
Intuitive Guide to Understanding KL Divergence
- What is a distributin?
- What is an event?
- Problem we’re trying to solve: choose a parameterized distribution => optimise its parameters): KL Divergence helps us to measure just how much information we lose when we choose an approximation compared with our observations.

Notes on Core ML Techniques

Knowledge Distillation

Confidence penalty & Label Smoothing && Ouput Regularisation

Uncertainty

Long-tailed Recognition

Meta-learning

Ensemble methods

Kullback-Leibler Divergence

What is the main difference between GAN and autoencoder?

What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

Ex-Postdoc@OxfordU

Error

Knowledge Distillation

Confidence penalty & Label Smoothing && Ouput Regularisation

Uncertainty

Long-tailed Recognition

Meta-learning

Ensemble methods

Kullback-Leibler Divergence

What is the main difference between GAN and autoencoder?

What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

Templates (for web app):

Error