Mehr als 0 und 1. Schule in einer digitalisierten WeltMehr als 0 und 1. Schule in einer digitalisierten WeltDöbeli Honegger, Beat2016

Paper summaryabhishmMost of the Q-learning methods such as DQN or Duelling network relies on Experience Replay. However, experience replay is memory intensive. Experience replay also forces us to use only offline learning algorithm such as Q-learning. Authors suggest to use multiple agents in parallel. These multiple agents update a shared global parameters. The **benefits** of using multiple agents are as following:
1. The use of multiple agents provides a stabilizing effect.
2. Learning can be much faster without using GPUs. It is possible to run the agents as a CPU thread. Learning is faster because more updates are being made and more data is being consumed in the same time because of multiple agents.
3. Learning is more robust and stable because there exist a wide range of learning rates and initial weights for which a good score can be achieved.
**Key Points**
1. The best performing algorithm is Asynchronous Advantage Actor Critic (A3C).
2. A3C uses $n-$step updates to tradeoff between bias and variance in the policy gradient. Essentially, the policy-gradient update is proportional to
$$
\nabla \log \pi(a_t | x_t; \theta) (r_t + \gamma r_{t+1}+\ldots + \gamma^{n-1}r_{t+n-1} + \gamma^n V(x_{t+n}) - V(x_t))
$$
where $V(\cdot)$ is the value function of the underlying MDP.
3. All the parameters in the value network and policy network are shared except the last layer that exclusively predict the action-probabilities and values.
4. The authors found that the use of an entropy bonus helped the network to not converge into sub-optimal policies.
5. The hyper-parameters (learning rate and gradient-norm clipping) were chosen by a random search on 6 games and keep fixed for the rest of the games.
6. A3C-LSTM also incorporates a LSTM layer with 128 cells. Each cell outputs the action probabilities and value function.

Most of the Q-learning methods such as DQN or Duelling network relies on Experience Replay. However, experience replay is memory intensive. Experience replay also forces us to use only offline learning algorithm such as Q-learning. Authors suggest to use multiple agents in parallel. These multiple agents update a shared global parameters. The **benefits** of using multiple agents are as following:
1. The use of multiple agents provides a stabilizing effect.
2. Learning can be much faster without using GPUs. It is possible to run the agents as a CPU thread. Learning is faster because more updates are being made and more data is being consumed in the same time because of multiple agents.
3. Learning is more robust and stable because there exist a wide range of learning rates and initial weights for which a good score can be achieved.
**Key Points**
1. The best performing algorithm is Asynchronous Advantage Actor Critic (A3C).
2. A3C uses $n-$step updates to tradeoff between bias and variance in the policy gradient. Essentially, the policy-gradient update is proportional to
$$
\nabla \log \pi(a_t | x_t; \theta) (r_t + \gamma r_{t+1}+\ldots + \gamma^{n-1}r_{t+n-1} + \gamma^n V(x_{t+n}) - V(x_t))
$$
where $V(\cdot)$ is the value function of the underlying MDP.
3. All the parameters in the value network and policy network are shared except the last layer that exclusively predict the action-probabilities and values.
4. The authors found that the use of an entropy bonus helped the network to not converge into sub-optimal policies.
5. The hyper-parameters (learning rate and gradient-norm clipping) were chosen by a random search on 6 games and keep fixed for the rest of the games.
6. A3C-LSTM also incorporates a LSTM layer with 128 cells. Each cell outputs the action probabilities and value function.