# Notes on Core ML Techniques

in Blogs

- Kullback-Leibler Divergence
- What is the main difference between GAN and autoencoder
What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

- Knowledge Distillation
Confidence penalty & Label Smoothing && Ouput Regularisation

- Uncertainty
- Long-tailed Recognition-Sample Imbalance
- Meta-learning
- Ensemble methods

### Knowledge Distillation

- Distilling the Knowledge in a Neural Network
**Knowledge definition**: A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.An obvious way to

**transfer the generalization ability of the cumbersome model to a small model**is to use the class probabilities produced by the cumbersome model as**“soft targets” for training the small model.**When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions => cumbersome and may be too computationally expensive

**Compress/distill the knowledge in an ensemble into a single model**which is much easier to deploy (distilling the knowledge in an ensemble of models into a single model);

### Confidence penalty & Label Smoothing && Ouput Regularisation

- Regularizing Neural Networks by Penalizing Confident Output Distributions
**Output regularisation**: Regularizing the output distribution of large, deep neural networks has largely been unexplored. Output regularization has the property that it is invariant to the parameterization of the underlying neural network.**Knowledge definition**: To motivate output regularizers, we can view the knowledge of a model as the conditional distribution it produces over outputs given an input (Hinton et al., 2015) as opposed to the learned values of its parameters.**Distillation definition:**explicitly training a small network to assign the same probabilities to incorrect classes as a large network or ensemble of networks that generalizes well.**Two output regularizers**:- A maximum entropy based confidence penalty;
- Label smoothing (uniform and unigram).
- We connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence.

- ANNEALING AND THRESHOLDING THE CONFIDENCE PENALTY
- Suggesting a confidence penalty that is weak at the beginning of training and strong near convergence.
- Only penalize output distributions when they are below a certain entropy threshold

- Label/Objective smoothing:
Smoothing the labels with a uniform distribution-Rethinking the Inception Architecture

Smooth the labels with a teacher model Distilling, Hinton et al., 2015

Smooth the labels with the model’s own distribution-TRAINING DEEP NEURAL NETWORKS ON NOISY LABELS WITH BOOTSTRAPPING (Reed et al., 2014)

Adding label noise simply-Disturblabel: Regularizing cnn on the loss layer–CVPR 2016

**Distillation and self-distillation both regularize a network by incorporating information about the ratios between incorrect classes.**

- Label-Smoothing Regularization proposed in Rethinking the Inception Architecture for Computer Vision-A mechanism for encouraging the model to be less confident.
- Over-fitting
- Reduces the ability of the model to
**adapt: bounded gradient**

- Virtual adversarial training (VAT) Distributional smoothing by virtual adversarial examples
- Another promising smoothing regularizer. However, it has multiple hyperparameters, significantly more computation in grid-searching

### Uncertainty

NeurIPS 2018: Predictive Uncertainty Estimation via Prior Networks

Information Constraints on Auto-Encoding Variational Bayes-NeurIPS 2018

The information bottleneck (IB) principle–The information bottleneck method

### Long-tailed Recognition

- DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION-ICLR2020
- Representation Learning: We first train models to learn representations with different sampling strategies, including the standard instance-based sampling, class-balanced sampling and a mixture of them.
- Classification: We study three different basic approaches to obtain a classifier with balanced decision boundaries, on top of the learned representations.

### Meta-learning

- Few-shot Learning is an instantiation of Meta-learning
- MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition
- Noisy Weighting: estimate the confidence scores of all the images in the noisy subset;
**MetaCleaner compares these representations in the feature space => discover relations between images => generate the confidence score of each image in the subset.** - Clean Hallucinating: to hallucinate a `clean‘ representation of a class from the noisy subset, by summarizing the noisy images with their confidence scores;
**MetaCleaner as a new layer before classifier: batch size $K \times N => K$, $K$ categories, $N$ images per class in the batch**.- Different from prototypical network, our MetaCleaner mainly develops a robust classifier to reduce confusion of noisy labels. Hence, it adaptively uses the weighted prototype as a ‘clean’ representation to generalize softmax classifier, instead of using the mean prototype to construct a metric classifier for low-shot learning.
**Why is this called meta-learning?**

- Noisy Weighting: estimate the confidence scores of all the images in the noisy subset;
- Learning to Learn From Noisy Labeled Data-CVPR 2019
- My Understanding: https://github.com/LiJunnan1992/MLNT/issues/1
- Iteratively Improve the Teacher/Oracle == Soft Target
**Meta-obejctive-training/testing:**The meta-training sees synthetic noisy training examples. After training on them, the meta-testing evaluates its consistency with oracle and aims to maximise the consistency, i.e., making it unaffected after seeing synthetic noise.

- Reddit Analysis: Extremely complex in practice. However, the ideas are interesting and novel.

- My Understanding: https://github.com/LiJunnan1992/MLNT/issues/1
- Learning to Reweight Examples for Robust Deep Learning-ICML 2018- Simultaneously minimize the loss on
**a clean unbiased validation set**.**Meta-objective**: a novel meta-learning algorithm that learns to assign weights to training examples based on their gradient directions.**Solution**: Suppose that**a pair of training and validation examples are very similar**, and they also provide**similar gradient directions**, then this training example is helpful and should be up-weighted, and conversely, if they provide**opposite gradient directions**, this training example is harmful and should be downweighed.

- Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting-NeurIPS 2019
- The major difference with Learning to Reweight Examples is that the weights are implicitly learned there, without an explicit weighting function.
**I am skeptical and not convinced here!**

- The major difference with Learning to Reweight Examples is that the weights are implicitly learned there, without an explicit weighting function.

### Ensemble methods

- Ensemble methods in machine learning
- Ensemble methods are learning algorithms that construct
**a set of classifiers**and then classify new data points by taking**a weighted vote of their predictions.** - The original ensemble method is
**Bayesian averaging**but more recent algorithms include error correcting output coding, Bagging and boosting.**This paper reviews these methods and explains why ensembles can often perform better than any single classifier.**- Bayesian Voting Enumerating the Hypotheses.
- Bagging: Bagging presents the learning algorithm with a training set that consists of a sample of $m$ training examples drawn randomly with replacement from the original training set of $m$ items.
- …

- Ensemble methods are learning algorithms that construct

### Kullback-Leibler Divergence

- How to approximate our data (choose a parameterized distribution => optimise its parameters): KL Divergence helps us to measure just how much information we lose when we choose an approximation compared with our observations.
- The most important metric in information theory is called
**Entropy**, typically denoted as $\mathbf{H}$. The definition of Entropy for a probability distribution is: $\mathbf{H}=-\sum_{i=1}^{n} p(\mathbf{x}_i) \log p(\mathbf{x}_i) $. - If we use $\log_2$ for our calculation we can interpret entropy as “the minimum number of bits it would take us to encode our information”.

- The most important metric in information theory is called
- Intuitive Guide to Understanding KL Divergence
- What is a distributin?
- What is an event?
- Problem we’re trying to solve: choose a parameterized distribution => optimise its parameters): KL Divergence helps us to measure just how much information we lose when we choose an approximation compared with our observations.

### What is the main difference between GAN and autoencoder?

- An autoencoder learns to represent some input information very efficiently, and subsequently how to reconstruct the input from it’s compressed form. ~ :) ~An autoencoder compresses its input down to a vector - with much fewer dimensions than its input data, and then transforms it back into a tensor with the same shape as its input over several neural net layers. They’re trained to reproduce their input, so it’s kind of like learning a compression algorithm for that specific dataset.
A GAN uses an adversarial feedback loop to learn how to generate some information that “seems real” (i.e. looks the same/sounds the same/is otherwise indistinguishable from some real data) ~ :) ~ Instead of being given a bit of data as input, it’s given a small vector of random numbers. The generator network tries to transform this little vector into a realistic sample from the training data. The discriminator network then takes this generated sample(and some real samples from the dataset) and learns to guess whether the samples are real or fake.

- Building Autoencoders in Keras
- Coding: GANs vs. Autoencoders: Comparison of Deep Generative Models

### What’s the difference between a Variational Autoencoder (VAE) and an Autoencoder?

Intuitively Understanding Variational Autoencoders – Towards Data Science by Irhum Shafkat.

Read Vishal Sharma's answer to What's the difference between a Variational Autoencoder (VAE) and an Autoencoder? on Quora

- Building Autoencoders in Keras
- VAEs and GANs Mihaela Rosca
- Going Beyond GAN? New DeepMind VAE Model Generates High Fidelity Human Faces