Prioritized Experience ReplayPrioritized Experience ReplayTom Schaul and John Quan and Ioannis Antonoglou and David Silver2015

Paper summarytxzhaothis paper: develop a framework to replay important transitions more frequently -> learn efficienty
prior work: uniformly sample a replay memory to get experience transitions
evaluate: DQN + PER outperform DQN on 41 out of 49 Atari games
## Introduction
**issues with online RL:** (solution: experience replay)
1. strongly correlated updates that break the i.i.d. assumption
2. rapid forgetting of rare experiences that could be useful later
**key idea:**
more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error
**issues with prioritization:**
1. loss of diversity -> alleviate with stochastic prioritization
2. introduce bias -> correct with importance sampling
## Prioritized Replay
**criterion:**
- the amount the RL agent can learn from a transition in its current state (expected learning progress) -> not directly accessible
- proxy: the magnitude of a transition’s TD error ~= how far the value is from its next-step bootstrap estimate
**stochastic sampling:**
$$P(i)=\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$
*p_i* > 0: priority of transition *i*; 0 <= *alpha* <= 1 determines how much prioritization is used.
*two variants:*
1. proportional prioritization: *p_i* = abs(TD\_error\_i) + epsilon (small positive constant to avoid zero prob)
2. rank-based prioritization: *p_i* = 1/rank(i); **more robust as it is insensitive to outliers**
https://i.imgur.com/T8je5wj.png
**importance sampling:**
IS weights:
$$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta $$
- weights can be folded into the Q-learning update by using $w_i*\delta_i$ instead of $\delta_i$
- weights normalized by $\frac{1}{\max w_i}$

First published: 2015/11/18 (4 years ago) Abstract: Experience replay lets online reinforcement learning agents remember and
reuse experiences from the past. In prior work, experience transitions were
uniformly sampled from a replay memory. However, this approach simply replays
transitions at the same frequency that they were originally experienced,
regardless of their significance. In this paper we develop a framework for
prioritizing experience, so as to replay important transitions more frequently,
and therefore learn more efficiently. We use prioritized experience replay in
Deep Q-Networks (DQN), a reinforcement learning algorithm that achieved
human-level performance across many Atari games. DQN with prioritized
experience replay achieves a new state-of-the-art, outperforming DQN with
uniform replay on 41 out of 49 games.

this paper: develop a framework to replay important transitions more frequently -> learn efficienty
prior work: uniformly sample a replay memory to get experience transitions
evaluate: DQN + PER outperform DQN on 41 out of 49 Atari games
## Introduction
**issues with online RL:** (solution: experience replay)
1. strongly correlated updates that break the i.i.d. assumption
2. rapid forgetting of rare experiences that could be useful later
**key idea:**
more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error
**issues with prioritization:**
1. loss of diversity -> alleviate with stochastic prioritization
2. introduce bias -> correct with importance sampling
## Prioritized Replay
**criterion:**
- the amount the RL agent can learn from a transition in its current state (expected learning progress) -> not directly accessible
- proxy: the magnitude of a transition’s TD error ~= how far the value is from its next-step bootstrap estimate
**stochastic sampling:**
$$P(i)=\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$
*p_i* > 0: priority of transition *i*; 0 <= *alpha* <= 1 determines how much prioritization is used.
*two variants:*
1. proportional prioritization: *p_i* = abs(TD\_error\_i) + epsilon (small positive constant to avoid zero prob)
2. rank-based prioritization: *p_i* = 1/rank(i); **more robust as it is insensitive to outliers**
https://i.imgur.com/T8je5wj.png
**importance sampling:**
IS weights:
$$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta $$
- weights can be folded into the Q-learning update by using $w_i*\delta_i$ instead of $\delta_i$
- weights normalized by $\frac{1}{\max w_i}$