It’s a commonly understood problem in Reinforcement Learning: that it is difficult to fully specify your exact reward function for an agent you’re training, especially when that agent will need to operate in conditions potentially different than those it was trained in. The canonical example of this, used throughout the Inverse Rewards Design paper, is that of an agent trained on an environment of grass and dirt, that now encounters an environment with lava. In a typical problem setup, the agent would be indifferent to passing or not passing over the lava, because it was never disincentivized from doing so during training. The fundamental approach this paper takes is to explicitly assume that there exists a program designer who gave the agent some proxy reward, and that that proxy reward is a good approximation of the true reward on training data, but might not be so on testing. This framing, of the reward as a noisy signal, allows the model to formalize its uncertainty about scenarios where the proxy reward might be a poor mapping to the real one. The way the paper tests this is through a pretty simplified model. In the example, the agent is given a reward function expressed by a weighting of different squares it could navigate into: it has a strong positive weight on dirt, and a strong negative one on grass. The agent then enters an environment where there is lava, which, implicitly, it has a 0 penalty for in its rewards function. However, it’s the case that, if you integrate over all possible weight values for “lava”, none of them would have produced different behavior over the training trajectories. Thus, if you assume high uncertainty, and adopt a risk-averse policy where under cases of uncertainty you assume bad outcomes, this leads to avoiding values of the environment feature vector that you didn’t have data weighting against during training. Overall, the intuition of this paper makes sense to me, but it’s unclear to me if the formulation it uses generalizes outside of a very trivial setting, where your reward function is an explicit and given function of your feature vectors, rather than (as is typical) a scalar score not explicitly parametrized by the states of game prior to the very last one. It’s certainly possible that it might, but, I don’t feel like I quite have the confidence to say at this point.