# Progressive Self Label Correction (ProSelfLC) for Training Robust Deep Neural Networks

For any specific discussion or potential future collaboration, please feel free to contact me. As a young researcher, your interest and star (citation) will mean a lot for me and my collaborators. For source codes, we are happy to provide if there is a request conditioned on academic use only and kindness to cite this work.

@article{wang2020proselflc,
title={ProSelfLC: Progressive Self Label Correction
for Training Robust Deep Neural Networks},
author={Wang, Xinshao and Hua, Yang and Kodirov, Elyor and Robertson, Neil M},
journal={arXiv preprint arXiv:2005.03788},
year={2020}
}


List of Content

## Storyline

• Human annotations contain bias, subjectiveness, and errors.
• Therefore, some prior work penalises low-entropy statuses => so that wrong fitting is alleviated in some degree. Representative proposals are label smoothing and confidence penalty.
• Our new finding on Entropy Minimisation:
• We can solve it still by minimum entropy regularisation principle;
• Diverse minimum-entropy statuses exist (e.g., when a learner perfectly fits random labels, the entropy also reaches a minimum):
• The minimum-entropy status defined by untrusted human-annotated labels is incorrect, thus leading to poor generalisation.
CCE => Non-meaningful minimum-entropy status => poor generalisation.
• We propose to redefine a more meaningful minimum-entropy status by exploiting the knowledge of a learner itself, which shows promising results.
Label correction => Meaningful low-entropy status => good generalisation.
• We highlight ProSelfLC’s Underlying Principle is ‘‘Contradictory’’ with: Maximum-Entropy Learning, Confidence Penalty and Label Smoothing, which are popular recently. Then we wish our community think critically about two principles:
• Rewarding a correct low-entropy status (ProSelfLC)
• Penalising a non-meaningful low-entropy status (CCE+LS, or CCE+CP)
• In our experiments: ProSelfLC > (CCE+LS, or CCE+CP) > CCE
• Being contradictory in entropy, both help but their angles differ:
• CCE fits non-meaningful patterns => LS and CP penalise such fitting;
• CCE fits non-meaningful patterns => ProSelfLC first corrects them => then fits.
• Why does CCE fit non-meaningful patterns?

## Open ML Research Questions

• Should we trust and exploit a learner’s knowledge as training goes, or always trust human annotations?
• As a learner, to trust yourself or supervison/textbooks?
• The answer should depend on what a learner has learned.
• Should we optimise a learner towards a correct low-entropy status, or penalise a low-entropy status?
• As a supervisor/evaluator, to reward or penalise a confident learner?
• Open discussion: we show it’s fine for a learner to be confident towards a correct low-entropy status. Then more future research attention should be paid to the definition of correct knowledge, as in general we accept, human annotations used for learning supervision may be biased, subjective, and wrong.
• As a supervisor, before training multiple learners, to think about how to train one great learner first?
• 1st context: recently, many techniques about training multiple learners (co-training, mutual learning, knowledge distillation, adversarial training, etc) have been proposed.
• 2nd context: in our work, we work on how to train single learner better.
• 1st personal comment: training multiple learners is much more expensive and complex;
• 2nd personal comment: when training multiple learners collaboratively, if one learner does not perform well, it tends to hurt the other learners.

## Noticeable Findings

• Rewarding low entropy (towards a meaningful status) leads to better generalisation than penalising low entropy.
• Result analysis:
• Revising the semantic class and perceptual similarity structure. Generally, the semantic class of an example is defined according to its perceptual similarities with training classes, and is chosen to be the most similar class. In Figure 3b and 3c, we show a learner’s behaviours on without fitting wrong labels and correcting them in different approaches. We remark that ProSelfLC performs the best.

• To reward or penalise low entropy? LS and CP are proposed to penalise low entropy. On the one hand, we observe that LS and CP work, being consistent with prior evidence. As shown in Figure 3d and 3e, the entropies of both clean and noisy subset are the largest in LS and CP, and correspondingly their generalisation performance is the best except for ProSelfLC in Figure 3f. On the other hand, our ProSelfLC has the lowest low entropy while performs the best, which demonstrates it does not hurt for a learner to be confident. However, a learning model needs to be careful about what to be confident in. Let us look at Figure 3b and 3c, ProSelfLC has the least wrong fitting while the highest semantic class correction rate, which denotes it is confident in learning meaningful patterns.

## In Self LC, a core question is not well answered:

\textit{How much do we trust a learner to leverage its knowledge?}$\textit{How much do we trust a learner to leverage its knowledge?}$

## Underlying Principle of ProSelfLC

• When a learner starts to learn, it trusts the supervision from human annotations.

This idea is inspired by the paradigm that deep models learn simple meaningful patterns before fitting noise, even when severe label noise exists in human annotations [1];

• As a learner attains confident knowledge as time goes, we leverage its confident knowledge to correct labels.

This is surrounded by minimum entropy regularisation, which has been widely evaluated in unsupervised and semi-supervised scenarios [10, 2].

## Design Reasons of ProSelfLC

• Regarding g(t)$g(t)$, in the earlier learning phase, i.e., t < \Gamma/2$% $, g(t) < 0.5 \Rightarrow \epsilon_{\mathrm{ProSelfLC}} < 0.5, \forall \mathbf{p}$% $, so that the human annotations dominate and ProSelfLC only modifies the similarity structure. This is because when a learner does not see the training data for enough times, we assume it is not trained well, which is the most elementary concept in deep learning. Most importantly, more randomness exists at the earlier phase, as a result, the learner may output a wrong confident prediction. In our design, \epsilon_{\mathrm{ProSelfLC}} < 0.5, \forall \mathbf{p}$% $ can assuage the bad impact of such unexpected cases.
When it comes to the later learning phase, i.e., t > \Gamma/2$t > \Gamma/2$, we have g(t) > 0.5$g(t) > 0.5$, which means overall we give enough credits to a learner as it has been trained for more than the half of total iterations.

• Regarding l(\mathbf{p})$l(\mathbf{p})$, we discuss its effect in the later learning phase when it becomes more meaningful. If \mathbf{p}$\mathbf{p}$ is not confident, l(\mathbf{p})$l(\mathbf{p})$ will be large, then \epsilon_{\mathrm{ProSelfLC}}$\epsilon_{\mathrm{ProSelfLC}}$ will be small, which means we choose to trust a one-hot annotation more when its prediction is of high entropy, so that we can further reduce the entropy of output distributions}. In this case, ProSelfLC only modifies the similarity structure. Beyond, when \mathbf{p}$\mathbf{p}$ is highly confident, there are two fine cases: If \mathbf{p}$\mathbf{p}$ is consistent with \mathbf{q}$\mathbf{q}$ in the semantic class, ProSelfLC only modifies the similarity structure too; If they are inconsistent, ProSelfLC further corrects the semantic class of a human annotation.

• Correct the similarity structure for every data point in all cases. Given any data point \mathbf{x}$\mathbf{x}$, by a convex combination of \mathbf{p}$\mathbf{p}$ and \mathbf{q}$\mathbf{q}$, we add the information about its relative probabilities of being different training classes using the knowledge of a learner itself.

• Revise the semantic class of an example only when the learning time is long and its prediction is confidently inconsistent. As highlighted in Table 2, only when two conditions are met, we have \epsilon_{\mathrm{ProSelfLC}} > 0.5$\epsilon_{\mathrm{ProSelfLC}} > 0.5$ and \argmax\nolimits_j \mathbf{p}(j|\mathbf{x}) \neq \argmax\nolimits_j \mathbf{q}(j|\mathbf{x})$\argmax\nolimits_j \mathbf{p}(j|\mathbf{x}) \neq \argmax\nolimits_j \mathbf{q}(j|\mathbf{x})$, then the semantic class in $\mathbf{\tilde{q}_{\mathrm{ProSelfLC}}}$ is changed to be determined by \mathbf{p}$\mathbf{p}$. For example, we can deduce \mathbf{p} = [0.95, 0.01, 0.04], \mathbf{q} = [0, 0, 1], \epsilon_{\mathrm{ProSelfLC}}=0.8 \Rightarrow \mathbf{\tilde{q}_{\mathrm{ProSelfLC}}}=(1- \epsilon_{\mathrm{ProSelfLC}}) \mathbf{q}+\epsilon_{\mathrm{ProSelfLC}} \mathbf{p}=[0.76, 0.008, 0.232]$\mathbf{p} = [0.95, 0.01, 0.04], \mathbf{q} = [0, 0, 1], \epsilon_{\mathrm{ProSelfLC}}=0.8 \Rightarrow \mathbf{\tilde{q}_{\mathrm{ProSelfLC}}}=(1- \epsilon_{\mathrm{ProSelfLC}}) \mathbf{q}+\epsilon_{\mathrm{ProSelfLC}} \mathbf{p}=[0.76, 0.008, 0.232]$.
Theoretically, ProSelfLC also becomes robust against long time being exposed to the training data, so that early stopping is not required.

• Contradictory Underlying Principle: Maximum-Entropy Learning, Confidence Penalty, Label Smoothing
• Deep models learn simple meaningful patterns before fitting noise, even when severe label noise exists in human annotations.
• 2019-Derivative manipulation for general example weighting
  @article{wang2019derivative,
title={Derivative Manipulation for
General Example Weighting},
author={Wang, Xinshao and Kodirov, Elyor and Hua, Yang and Robertson, Neil M},
journal={arXiv preprint arXiv:1905.11233},
year={2019}
}

• 2019-IMAE for noise-robust learning: Mean absolute error does not treat examples equally and gradient magnitude’s variance matters.
  @article{wang2019imae,
title={ {IMAE} for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally