Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic

Yan, Lian and Dodier, Robert H. and Mozer, Michael and Wolniewicz, Richard H.

International Conference on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

Yan, Lian and Dodier, Robert H. and Mozer, Michael and Wolniewicz, Richard H.

International Conference on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

[link]
In binary classification task on an imbalanced dataset, we often report *area under the curve* (AUC) of *receiver operating characteristic* (ROC) as the classifier's ability to distinguish two classes. If there are $k$ errors, accuracy will be the same irrespective of how those $k$ errors are made i.e. misclassification of positive samples or misclassification of negative samples. AUC-ROC is a metric that treats these misclassifications asymmetrically, making it an appropriate statistic for classification tasks on imbalanced datasets. However, until this paper, AUC-ROC was hard to quantify and differentiate to gradient-descent over. This paper approximated AUC-ROC by a Wilcoxon-Mann-Whitney statistic which counts the "number of wins" in all the pairwise comparisons - $ U = \frac{\sum_{i=1}^{m}\sum_{j=1}^{n}I(x_i, x_j)}{mn}, $ where $m$ is the total number of positive samples, $n$ is the number of negative samples, and $I(x_i, x_j)$ is $1$ if $x_i$ is ranked higher than $x_j$. Figure 1 in the paper shows the variance of this statistic with an increasing imbalance in the dataset, justifying the close correspondence with AUC-ROC. Further, to make this metric smooth and differentiable, the step function of pairwise comparison is replaced by sigmoid or hinge functions. Further extensions are made to apply this to multi-class classification tasks and focus on top-K predictions i.e. optimize lower-left part of AUC. |

Meta-Learning with Implicit Gradients

Aravind Rajeswaran and Chelsea Finn and Sham Kakade and Sergey Levine

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, math.OC, stat.ML

**First published:** 2019/09/10 (1 year ago)

**Abstract:** A core capability of intelligent systems is the ability to quickly learn new
tasks by drawing on prior experience. Gradient (or optimization) based
meta-learning has recently emerged as an effective approach for few-shot
learning. In this formulation, meta-parameters are learned in the outer loop,
while task-specific models are learned in the inner-loop, by using only a small
amount of data from the current task. A key challenge in scaling these
approaches is the need to differentiate through the inner loop learning
process, which can impose considerable computational and memory burdens. By
drawing upon implicit differentiation, we develop the implicit MAML algorithm,
which depends only on the solution to the inner level optimization and not the
path taken by the inner loop optimizer. This effectively decouples the
meta-gradient computation from the choice of inner loop optimizer. As a result,
our approach is agnostic to the choice of inner loop optimizer and can
gracefully handle many gradient steps without vanishing gradients or memory
constraints. Theoretically, we prove that implicit MAML can compute accurate
meta-gradients with a memory footprint that is, up to small constant factors,
no more than that which is required to compute a single inner loop gradient and
at no overall increase in the total computational cost. Experimentally, we show
that these benefits of implicit MAML translate into empirical gains on few-shot
image recognition benchmarks.
more
less

Aravind Rajeswaran and Chelsea Finn and Sham Kakade and Sergey Levine

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, cs.AI, math.OC, stat.ML

[link]
This paper builds upon the previous work in gradient-based meta-learning methods. The objective of meta-learning is to find meta-parameters ($\theta$) which can be "adapted" to yield "task-specific" ($\phi$) parameters. Thus, $\theta$ and $\phi$ lie in the same hyperspace. A meta-learning problem deals with several tasks, where each task is specified by its respective training and test datasets. At the inference time of gradient-based meta-learning methods, before the start of each task, one needs to perform some gradient-descent (GD) steps initialized from the meta-parameters to obtain these task-specific parameters. The objective of meta-learning is to find $\theta$, such that GD on each task's training data yields parameters that generalize well on its test data. Thus, the objective function of meta-learning is the average loss on the training dataset of each task ($\mathcal{L}_{i}(\phi)$), where the parameters of that task ($\phi$) are obtained by performing GD initialized from the meta-parameters ($\theta$). \begin{equation} F(\theta) = \frac{1}{M}\sum_{i=1}^{M} \mathcal{L}_i(\phi) \end{equation} In order to backpropagate the gradients for this task-specific loss function back to the meta-parameters, one needs to backpropagate through task-specific loss function ($\mathcal{L}_{i}$) and the GD steps (or any other optimization algorithm that was used), which were performed to yield $\phi$. As GD is a series of steps, a whole sequence of changes done on $\theta$ need to be considered for backpropagation. Thus, the past approaches have focused on RNN + BPTT or Truncated BPTT. However, the author shows that with the use of the proximal term in the task-specific optimization (also called inner optimization), one can obtain the gradients without having to consider the entire trajectory of the parameters. The authors call these implicit gradients. The idea is to constrain the $\phi$ to lie closer to $\theta$ with the help of proximal term which is similar to L2-regularization penalty term. Due to this constraint, one obtains an implicit equation of $\phi$ in terms of $\theta$ as \begin{equation} \phi = \theta - \frac{1}{\lambda}\nabla\mathcal{L}_i(\phi) \end{equation} This is then differentiated to obtain the implicit gradients as \begin{equation} \frac{d\phi}{d\theta} = \big( \mathbf{I} + \frac{1}{\lambda}\nabla^{2} \mathcal{L}_i(\phi) \big)^{-1} \end{equation} and the contribution of gradients from $\mathcal{L}_i$ is thus, \begin{equation} \big( \mathbf{I} + \frac{1}{\lambda}\nabla^{2} \mathcal{L}_i(\phi) \big)^{-1} \nabla \mathcal{L}_i(\phi) \end{equation} The hessian in the above gradients are memory expensive computations, which become infeasible in deep neural networks. Thus, the authors approximate the above term by minimizing the quadratic formulation using conjugate gradient method which only requires Hessian-vector products (cheaply available via reverse backpropagation). \begin{equation} \min_{\mathbf{w}} \mathbf{w}^\intercal \big( I + \frac{1}{\lambda}\nabla^{2} \mathcal{L}_i(\phi) \big) \mathbf{w} - \mathbf{w}^\intercal \nabla \mathcal{L}_i(\phi) \end{equation} Thus, the paper introduces computationally cheap and constant memory gradient computation for meta-learning. |

About