The fundamental unit of Reinforcement Learning is the reward function, with a core assumption of the area being that actions induce rewards, with some actions being higher reward than others. But, reward functions are just artificial objects we design to induce certain behaviors; the universe doesn’t hand out “true” rewards we can build off of. Inverse Reinforcement Learning as a field is rooted in the difficulty of designing reward functions, and has the aspiration of, instead of requiring a human to hard code a reward function, inferring rewards from observing human behavior. The rough idea is that if we imagine a human is (even if they don’t know it) operating so as to optimize some set of rewards, we might be able to infer that set of underlying incentives from their actions, and, once we’ve extracted a reward function, use that to train new agents. This is a mathematically quite tricky problem, for the basic reason that a human’s actions are often consistent with a wide range of possible underlying “policy” parameters, and also that a given human policy could be an optimal for a wide range of underlying reward functions. This paper proposes using an adversarial frame on the problem, where you learn a reward function by trying to make reward higher for the human demonstrations you observe, relative to the actions the agent itself is taking. This has the effect of trying to learn an agent that can imitate human actions. However, it specifically designs its model structure to allow it to go beyond just imitation. The problem with learning a purely imitative policy is that it’s hard for the model to separate out which actions the human is taking because they are intrinsically high reward (like, perhaps, eating candy), versus actions which are only valuable in a particular environment (perhaps opening a drawer if you’re in a room where that’s where the candy is kept). If you didn’t realize that the real reward was contained in the candy, you might keep opening drawers, even if you’re in a room where the candy is laying out on the table. In mathematical terms, separating out intrinsic vs instrumental (also known as "shaped") rewards is a matter of making sure to learn separate out the reward associated with a given state from value of taking a given action at that state, because the value of your action is only born out based on assumptions about how states transition between each other, which is a function of the specific state to state dynamics of the you’re in. The authors do this by defining a g(s) function, and a h(s) function. They then define their overall reward of an action as (g(s) + h(s’)) - h(s), where s’ is the new state you end up in if you take an action. https://i.imgur.com/3ENPFVk.png This follows the natural form of a Bellman update, where the sum of your future value at state T should be equal to the sum of your future value at time T+1 plus the reward you achieve at time T. https://i.imgur.com/Sd9qHCf.png By adopting this structure, and learning a separate neural network to capture the h(s) function representing the value from here to the end, the authors make it the case that the g(s) function is a purer representation of the reward at a state, regardless of what we expect to happen in the future. Using this, they’re able to use this learned reward to bootstrap good behavior in new environments, even in contexts where a learned value function would be invalid because of the assumptions of instrumental value. They compare their method to the baseline of GAIL, which is a purely imitation-learning approach, and show that theirs is more able to transfer to environments with similar states but different state-to-state dynamics.