AI Safety Gridworlds on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

AI Safety Gridworlds
Jan Leike and Miljan Martic and Victoria Krakovna and Pedro A. Ortega and Tom Everitt and Andrew Lefrancq and Laurent Orseau and Shane Legg
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.AI
more

Summaries/Notes 1

[link] Summary by dniku 4 years ago

The paper proposes a standardized benchmark for a number of safety-related problems, and provides an implementation that can be used by other researchers. The problems fall in two categories: specification and robustness. Specification refers to cases where it is difficult to specify a reward function that encodes our intentions. Robustness means that agent's actions should be robust when facing various complexities of a real-world environment. Here is a list of problems:

1. Specification:
1. Safe interruptibility: agents should neither seek nor avoid interruption.
2. Avoiding side effects: agents should minimize effects unrelated to their main objective.
3. Absent supervisor: agents should not behave differently depending on presence of supervisor.
4. Reward gaming: agents should not try to exploit errors in reward function.

2. Robustness:
1. Self-modification: agents should behave well when environment allows self-modification.
2. Robustness to distributional shift: agents should behave robustly when test differs from train.
3. Robustness to adversaries: agents should detect and adapt to adversarial intentions in environment.
4. Safe exploration: agent should behave safely during learning as well.

It is worth noting that problems 1.2, 1.4, 2.2, and 2.4 have been described back in "Concrete Problems in AI Safety".

It is suggested that each of these problems be tackled in a "gridworld" environment — a 2D environment where the agent lives on a grid, and the only actions it has available are up/down/left/right movements. The benchmark consists of 10 environments, each corresponding to one of 8 problems mentioned above. Each of the environments is an extremely simple instance of the problem, but nevertheless they are of interest as current SotA algorithms usually don't solve the posed task.

Specifically, the authors trained A2C and Rainbow with DQN update on each of the environments and showed that both algorithms fail on all of specification problems, except for Rainbow on 1.1. This is expected, as neither of those algorithms are designed for cases where reward function is misspecified. Both algorithms failed on 2.2--2.4, except for A2C on 2.3. On 2.1, the authors swapped A2C for Rainbow with Sarsa update and showed that Rainbow DQN failed while Rainbow Sarsa performed well.

Overall, this is a good groundwork paper with only a few questionable design decisions, such as the design of actual reward in 1.2. It is unlikely to have impact similar to MNIST or ImageNet, but it should stimulate safety-related research.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private