[link]
The idea of the paper is to use reinforcement learning methods to learn a step size adaption rule to be used with evolutionary strategies. The authors use SARSA \cite{Rummery:1994} to learn a step size adaption. Their goal is to learn an adaption rule similar to the $1/5$ success rule \cite{schwefel1995evolution-and-o}, which updates the step size every $n$ mutations. It checks the number of successes over the preceding $10n$ mutations. If the number is less than $2n$ the step lengths are multiplied by a factor $0.85$ otherwise they are divided by $0.85$. SARSA learns a policy which decides (after every $n$ mutations) to either divide the step size; multiply the step size or keep the step size constant (using the same adaption factor of $0.85$) They evaluate their approach on the Rosenbrock function in several dimensions as well on optimization of the sphere. The optimization is terminated as soon as the optimized function value is smaller than a threshold $f < 10 ^{-10}$. If this criterion is not met within $N$ generations, the run is counted as not converged. The authors consider four reward funcions to learn the desired policy: $$\begin{array}{ll} R_1(s_t)=\left\{ \begin{array}{ll} +1&\mathrm{SR}\;\mathrm{increased}\\ \;\;\;0&\mathrm{SR}\;\mathrm{unchanged}\\ -1&\mathrm{SR}\;\mathrm{decreased} \end{array}\right\. & R_2(s_t)=f_{s_t} - f_{s_{t-1}}\\ R_3(s_t)=\mathrm{sgn}(f_{s_t} - f_{s_{t-1}})& R_4(s_t)=||x_{s_t}-x_{s_{t-1}}||\cdot\mathrm{sgn}(f_{s_t} - f_{s_{t-1}}) \end{array}$$ where SR is the convergence rate, $R(s_t)$ is the reward at step $t$, $s_t$ the current state, $s_{t-1}$ the state at which the previous reward was computed and $x_{s_t}$ the point at which $f$ is evaluated at time step $t$. Additionally they constrain the step size to lie in the interval $\left[10^{-15}, 10^{15}\right\]$, giving a reward of $-1$ if the step size is out of bounds. With their experiments they demonstrated that they could learn a policy that is very close to the $1/5$ rule. The combined RL-ES approach converged slower (e.g. $R_1$ $3$ to $7$ times slower on the Sphere function and $R_2\approx 3$times slower) than the pure ES approach and the following table shows that different rewards lead to different convergence rates. Problem | $N$ | $R_1$ conv. rate |$R_2$ conv. rate |$R_3$ conv. rate |$R_4$ conv. rate ---------:|:------:|------------------------:|----------------------:|-----------------------:|-----------------------: Sph 1D| $10^4$| $0.72$|$1.00$|$0.96$|$1.00$ Sph 5D| $10^4$| $0.64$|$0.98$|$0.94$|$0.99$ Sph 20D| $10^4$| $0.50$|$0.92$|$0.90$|$0.94$ Sph 80D| $10^5$| $0.57$|$1.00$|$0.99$|$1.00$ Ros 2D| $10^6$| $0.63$|$0.99$|$0.80$|$1.00$ Ros 5D| $10^7$| $0.77$|$1.00$|$0.77$|$1.00$ where $N$ is the number of generations within which the optimized function value has to be lower than $10^{-10}$. The authors demonstrated in a small setting that they were able to learn a well tested adaptation rule. They demonstrated neatly that the choice of the reward function is crucial in learning and that similar rewards can lead to very different results. |
[link]
The authors present an iterative approach for optimizing policies with guaranteed monotonic improvement. TRPO is similar to natural policy gradient methods and can be applied effectively in optimization of large nonlinear policies. \cite{KakadeL02} gave monotonic improvement guarantees for mixture of policies $\pi_{new}(a|s)=(1-\alpha)\pi_{old}(a|s) + \alpha\pi'(a|s)$ where $\pi'=\mathrm{arg}\max_{\pi'}L_{\pi_{old}}(\pi')$ is the approximated expected return of a policy $\pi'$ in terms of the advantage over $\pi_{old}$, as $\eta(\pi_{new})\geq L_{\pi_{old}}(\pi_{new}) - \frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2$ with $\eta$ the true expected return and $\epsilon$ the maximally expected advantage. The authors extend this approach to be applicable for all stochastic policy classes by replacing $\alpha$ with a distance measure between two policies $\pi_{new}$ and $\pi_{old}$. As distance measure they use the maximal Kullback–Leibler divergence $D_{KL}^{\max}(\pi_{new},\pi_{old})$ and show that $\eta(\pi_{new})\geq L_{\pi_{old}}(\pi_{new}) -CD_{KL}^{\max}(\pi_{new},\pi_{old})$, with $C= \frac{4\epsilon\gamma}{(1-\gamma)^2}$. From this follows, that one is guaranteed to improve the true objective $\eta$ when performing the following maximaization $\mathrm{maximize}_\pi\left[L_{\pi_{old}}(\pi)-CD_{KL}^{\max}(\pi,\pi_{old})\right]$. In practice however $C$ would only allow for small steps. Thus constraining $\mathrm{maximize}_\pi L_{\pi_{old}}(\pi)$ subject to $D_{KL}^{\max}(\pi,\pi_{old}) \leq \delta$ allows for larger steps in a **Trust Region** Due to the large number of constraints this problem is impractical to solve, which is why the authors replace the maximum KL divergence with approximated average KL. TRPO then works as follows: 1. Use a rollout procedure to collet a set of state-action-pairs wit Monte Carlo estimates of their $Q$-Values 2. Average over the samples to construct the estimate objective $L_{\pi}$ as well as the constraint 3. Approximately solve the constrained optimization problem to update the policy parameters. They use the conjugate gradient algorithm followed by a linesearch. Their experiments support the claim that TRPO is able to effectively optimize large nonlinear policies. |