arXiv is an e-print service in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance and statistics.

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Stephane Ross and Geoffrey J. Gordon and J. Andrew Bagnell

arXiv e-Print archive - 2010 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2010/11/02 (10 years ago)

**Abstract:** Sequential prediction problems such as imitation learning, where future
observations depend on previous predictions (actions), violate the common
i.i.d. assumptions made in statistical learning. This leads to poor performance
in theory and often in practice. Some recent approaches provide stronger
guarantees in this setting, but remain somewhat unsatisfactory as they train
either non-stationary or stochastic policies and require a large number of
iterations. In this paper, we propose a new iterative algorithm, which trains a
stationary deterministic policy, that can be seen as a no regret algorithm in
an online learning setting. We show that any such no regret algorithm, combined
with additional reduction assumptions, must find a policy with good performance
under the distribution of observations it induces in such sequential settings.
We demonstrate that this new approach outperforms previous approaches on two
challenging imitation learning problems and a benchmark sequence labeling
problem.
more
less

Stephane Ross and Geoffrey J. Gordon and J. Andrew Bagnell

arXiv e-Print archive - 2010 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
## General Framework The imitation learning problem is here cast into a classification problem: label the state with the corresponding expert action. With this, you can see structured prediction (predict next label knowing your previous prediction) as a degenerated IL problem. They make the **reduction assumption** that you can make the probability of mistake $\epsilon$ as small as desired on the **training distribution** (expert or mixture). They also assume that the difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by expert policy is small (which does not hold if one mistake makes you for fall from a cliff for example!). ## Problem definition There are three problems that are highlighted in this paper **1-interaction between policy and distribution**, **2-violation of the i.i.d assumption**, and **3-distributional shift and resulting compounding error of BC**. The paper implies that **1** implies **2** that implies **3** but they eventually only address **3** with the algorithm they propose (DAgger). 1. **interaction between policy and distribution**: the fact that the learned policy is present in the expectation (through the state-action distribution) as well as in the loss makes the **objective non-convex** and this even for convex loss functions. **Yet this also implies non i.i.d. samples!** Indeed, even if one could directly sample from the state-action distribution (like having its analytical form or an infinite experience replay buffer) and thus draw i.i.d. samples, the dependency will occur across optimization steps: if I draw a sample and use it to update my policy, I also update the distribution from which I will draw my next sample and then my next sample depends on my previous sample (since it conditioned my policy update). 2. **violation of the i.i.d. assumption**: So we just discussed that 1. implies non-iid samples across updates (batches) yet there is another form of dependency (inside a batch this time) as in practice we collect samples using rollouts and in a trajectory s' depends on (s,a) so we have a dependency because we collected the **samples sequentially**. 3. **distribution-shift and compounding error of Behavior cloning (supervised learning)**: During training, you assume iid data and you do not model that the next state you will have to classify (act upon) is influenced by your previous action. But at execution samples are not iid: if you do a mistake you reach a state that you may have never seen (distribution shift) and you cannot recover(compounding error). If the classifier's probability of making a mistake is $\epsilon$ and the episode length is $T$, then BC can make as many as $\epsilon T^2$ mistakes (CF Levine's class example with the funambulist and the worst-case scenario: as soon as you make a mistake, you cannot recover and you make mistakes until the end of the episode). **This is what this paper (and the papers it compares to) addresses, they propose a worst-case scenario linear in both** $\epsilon$ and $T$. **Additionnal assumptions compared to BC:**During training you are allowed to query the expert as much as you want. ## Other methods * **Forward Training**: trains a **non-stationary policy** (one policy for each time step). Each policy is trained to match the expert on the distribution of states encountered at that time-step when doing rollouts with the previously learned policies (there is no distribution shift since each policy is trained on the actual distribution of states it will see). **Problem**: you have to train $T$ different policies, you cannot stop until you trained all the policies. * **Stochastic Mixing Iterative Learning (SMILe)**: iteratively trains a **stochastic mixture of policies**. Each new policy is trained to match the expert on the state distribution of the current mixture, therefore iteratively accounting for the distribution drift, eventually, the distribution should converge. This yields near-linear regret on $T$ and $\epsilon$. **Problem**: At execution time, we sample a policy from the mixture, which can perform poorly on the given state, especially if it corresponded to a different distribution (early iteration for example). ## DAgger -- Dataset Aggregation **Algo** ![](https://i.imgur.com/FpQAyI2.png =400x) $\beta_i$ is related to the proportion of state distribution that we want to collect similar to the expert distribution. This may help in the early iterations, where the policy is bad and may spend a lot of time collecting datapoints in states that are irrelevant. In practice using $\beta_i=I(i=0)$ works well. Intuitively, by training on generated transitions, DAgger reduces the distribution shift. **Link to no Regret Online Learning** DAgger can be seen as a *Follow-the-Leader* (FTL) algorithm where at iteration $n$ we pick the best policy in hindsight, i.e. with respect to all the transitions seen so far (in previous iterations). The use of $\beta_i$ is an "extra trick" that doesn't exist as is in FTL algos. *Regret*: the difference between what I did and the best thing that I could have done. Each iteration of DAgger can be seen as a step of Online Learning (treat mini-batches of trajectories under a single policy as a single online-learning example) where the online algo used is FTL but could actually be any no-regret algo. Indeed, by using the no-regret algo view, the authors show that using no-regret algos (regret decays with $1/N$) as a meta-algorithm for IL (just like FTL for DAgger) yield that, for N big enough (in the order of T), we have a worst-case performance linear in $\epsilon$, $u$ and $T$. Where it is **assumed** that we can reach an error probability of $\epsilon$ on any training distribution and a difference in the cost-to-go (-Q-value) of $u$. In other words, if we can guarantee a classification error rate of $\epsilon$ **on the training distribution** and that the expert is able to recover from a mistake ($u$ is bounded) then the use of a **no-regret online learning algorithm** will yield a policy that will make at most (proportional to) $\epsilon T u$ errors **on its own state distribution (test distribution)**. **Requirements** * An error rate of $\epsilon$ on the training distribution * A strongly convex loss function between expert and learner labels (0-1 loss is not but binary cross-entropy that is usually used for classification is) so that FTL is indeed a no-regret algo. If the loss in not strongly convex, FTL has no guarantees and we must use another no-regret method * The difference in the cost-to-go (Q-value) between expert-action followed by expert policy and any action followed by the expert policy must be small (expert must be able to recover: this does not account for falling from a cliff for example!). **Results** ![](https://i.imgur.com/DZTK2nu.png =500x) ![](https://i.imgur.com/rbtyUbY.png =500x) ![](https://i.imgur.com/ZViGENA.png =500x) ![](https://i.imgur.com/fExmxeb.png =500x) ![](https://i.imgur.com/kv1X01w.png =500x) |

About