### Introduction * *Curriculum Learning* - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks. * Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy. * [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper. ### Contributions of the paper * Explore cases that show that curriculum learning benefits machine learning. * Offer hypothesis around when and why does it happen. * Explore relation of curriculum learning with other machine learning approaches. ### Experiments with convex criteria * Training perceptron where some input data is irrelevant(not predictive of the target class). * Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane. * Curriculum learning model outperforms no-curriculum based approach. * Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy. ### Experiments on shape recognition with datasets having different variability in shapes * Standard(target) dataset - Images of rectangles, ellipses, and triangles. * Easy dataset - Images of squares, circles, and equilateral triangles. * Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*). * For no-curriculum learning, the first epoch is the *switch epoch*. * As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs. * Paper does not report results for higher values of *switch epoch*. ### Experiments on language modeling * Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words. * Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary. * Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt). * The model predicts the score of next word, given a window of words. * Expected value of ranking loss function is minimized to learn **W**. * Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words. ### Curriculum as a continuation method * Continuation methods start with a smoothed objective function and gradually move to less smoothed function. * Useful in the case where the objective function in non-convex. * Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function. * Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$. * Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$. * Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective. * The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step). ### Advantages of Curriculum Learning * Faster training in the online setting as learner does not try to learn difficult examples when it is not ready. * Guiding training towards better local minima in parameter space, specifically useful for non-convex methods. ### Relation to other machine learning approaches * **Unsupervised preprocessing** - Both have a regularizing effect and lower the generalization error for the same training error. * **Active learning** - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy. * **Boosting Algorithms** - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change. * **Transfer learning** and **Life-long learning** - Initial tasks are used to guide the optimisation problem. ### Criticism * Curriculum Learning is not well understood, making it difficult to define the curriculum. * In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.