[link]
## General Framework Really **similar to DAgger** (see [summary](https://www.shortscience.org/paper?bibtexKey=journals/corr/1011.0686&a=muntermulehitch)) but considers **costsensitive classification** ("some mistakes are worst than others": you should be more careful in imitating that particular action of the expert if failing in doing so incurs a large costtogo). By doing so they improve from DAgger's bound of $\epsilon_{class}uT$ where $u$ is the difference in costtogo (between the expert and one error followed by expert policy) to $\epsilon_{class}T$ where $\epsilon_{class}$ is the error due to the lack of expressiveness of the policy class. In brief, by accounting for the effect of a mistake on the costtogo they remove the costtogo contribution to the bound (difference in the performance of the learned policy vs. expert policy) and thus have a tighter bound. In the paper they use the word "regret" for two distinct concepts which is confusing to me: one for the noregret online learning metaapproach to IL (similar to DAgger) and another one because Aggrevate aims at minimizing the costtogo difference with the expert (costtogo difference: the sum of the cost I endured because I did not behave like the expert once = *regret*) compared to DAgger that aims at minimizing the error rate wrt. the expert. Additionally, the paper extends the view of Imitation learning as an online learning procedure to Reinforcement Learning. ## Assumptions **Interactive**: you can requery the expert and thus reach $\epsilon T$ bounds instead of $\epsilon T^2$ like with noninteractive methods (Behavioral Cloning) due to compounding error. Additionally, one also needs a **reward/cost** that **cannot** be defined relative to the expert (no 01 loss wrt expert for ex.) since the costtogo is computed under the expert policy (would always yield 0 cost). ## Other methods **SEARN**: does also reason about **costtogo but under the current policy** instead of the expert's (even if you can use the expert's in practice and thus becomes really similar to Aggrevate). SEARN uses **stochastic policies** and can be seen as an Aggrevate variant where stochastic mixing is used to solve the online learning problem instead of **RegularizedFollowTheLeader (RFTL)**. ## Aggrevate  IL with costtogo ![](https://i.imgur.com/I1otJwV.png) Pretty much like DAgger but one has to use a noregret online learning algo to do **costsensitive** instead of regular classification. In the paper, they use the RFTL algorithm and train the policy on all previous iterations. Indeed, using RFTL with strongly convex loss (like the squared error) with stable batch leaner (like stochastic gradient descent) ensures the noregret property. In practice (to deal with infinite policy classes and knowing the cost of only a few actions per state) they reduce costsensitive classification to an **argmax regression problem** where they train a model to match the cost given stateaction (and time if we want nonstationary policies) using the collected datapoints and the (strongly convex) squared error loss. Then, they argmin this model to know which action minimizes the costtogo (costsensitive classification). This is close to what we do for **Qlearning** (DQN or DDPG): fit a critic (Qvalues) with the TDerror (instead of full rollouts costtogo of expert), argmax your critic to get your policy. Similarly to DQN, the way you explore the actions of which you compute the costtogo is important (in this paper they do uniform exploration). **Limitations** If the policy class is not expressive enough and cannot match the expert policy performance this algo may fail to learn a reasonable policy. Example: the task is to go for point A to point B, there exist a narrow shortcut and a safer but longer road. The expert can handle both roads so it prefers taking the shortcut. Even if the learned policy class can handle the safer road it will keep trying to use the narrow one and fail to reach the goal. This is because all the coststogo are computed under the expert's policy, thus ignoring the fact that they cannot be achieved by any policy of the learned policy class. ## RL via NoRegrety Policy Iteration  NRPI ![](https://i.imgur.com/X4ckv1u.png) NRPI does not require an expert policy anymore but only a **state exploration distribution**. NRPI can also be preferred when no policy in the policy class can match the expert's since it allows for more exploration by considering the **costtogo of the current policy**. Here, the argmax regression equivalent problem is really similar to Qlearning (where we use sampled costtogo from rollouts instead of Bellman errors) but where **the costtogo** of the aggregate dataset corresponds to **outdated policies!** (in contrast, DQN's data is comprised of rewards instead of coststogo). Yet, since RFTL is a noregret online learning method, the learned policy performs well under all the coststogo of previous iterations and the policies as well as the coststogo converge. The performance of NRPI is strongly limited to the quality of the exploration distribution. Yet if the exploration distribution is optimal, then NRPI is also optimal (the bound $T\epsilon_{regret} \rightarrow 0$ with enough online iterations). This may be a promising method for not interactive, stateonly IL (if you have access to a reward). ## General limitations Both methods are much less sample efficient than DAgger as they require coststogo: one full rollout for one datapoint. ## Broad contribution Seeing iterative learning methods such as Qlearning in the light of online learning methods is insightful and yields better bounds and understanding of why some methods might work. It presents a good tool to analyze the dynamics that interleaves learning and execution (optimizing and collecting data) for the purpose of generalization. For example, the bound for NRPI can seem quite counterintuitive to someone familiar with onpolicy/offpolicy distinction, indeed NRPI optimizes a policy wrt to **coststogo of other policies**, yet RFTL tells us that it converges towards what we want. Additionally, it may give a practical advantage for stability as the policy is optimized with larger batches and thus as to be good across many states and many costtogo formulations.
Your comment:
