First published: 2017/11/27 (1 year ago) Abstract: We present a suite of reinforcement learning environments illustrating
various safety properties of intelligent agents. These problems include safe
interruptibility, avoiding side effects, absent supervisor, reward gaming, safe
exploration, as well as robustness to self-modification, distributional shift,
and adversaries. To measure compliance with the intended safe behavior, we
equip each environment with a performance function that is hidden from the
agent. This allows us to categorize AI safety problems into robustness and
specification problems, depending on whether the performance function
corresponds to the observed reward function. We evaluate A2C and Rainbow, two
recent deep reinforcement learning agents, on our environments and show that
they are not able to solve them satisfactorily.
The paper proposes a standardized benchmark for a number of safety-related problems, and provides an implementation that can be used by other researchers. The problems fall in two categories: specification and robustness. Specification refers to cases where it is difficult to specify a reward function that encodes our intentions. Robustness means that agent's actions should be robust when facing various complexities of a real-world environment. Here is a list of problems:
1. Safe interruptibility: agents should neither seek nor avoid interruption.
2. Avoiding side effects: agents should minimize effects unrelated to their main objective.
3. Absent supervisor: agents should not behave differently depending on presence of supervisor.
4. Reward gaming: agents should not try to exploit errors in reward function.
1. Self-modification: agents should behave well when environment allows self-modification.
2. Robustness to distributional shift: agents should behave robustly when test differs from train.
3. Robustness to adversaries: agents should detect and adapt to adversarial intentions in environment.
4. Safe exploration: agent should behave safely during learning as well.
It is worth noting that problems 1.2, 1.4, 2.2, and 2.4 have been described back in "Concrete Problems in AI Safety".
It is suggested that each of these problems be tackled in a "gridworld" environment — a 2D environment where the agent lives on a grid, and the only actions it has available are up/down/left/right movements. The benchmark consists of 10 environments, each corresponding to one of 8 problems mentioned above. Each of the environments is an extremely simple instance of the problem, but nevertheless they are of interest as current SotA algorithms usually don't solve the posed task.
Specifically, the authors trained A2C and Rainbow with DQN update on each of the environments and showed that both algorithms fail on all of specification problems, except for Rainbow on 1.1. This is expected, as neither of those algorithms are designed for cases where reward function is misspecified. Both algorithms failed on 2.2--2.4, except for A2C on 2.3. On 2.1, the authors swapped A2C for Rainbow with Sarsa update and showed that Rainbow DQN failed while Rainbow Sarsa performed well.
Overall, this is a good groundwork paper with only a few questionable design decisions, such as the design of actual reward in 1.2. It is unlikely to have impact similar to MNIST or ImageNet, but it should stimulate safety-related research.