Over the past few days, I've been reading about different generative neural networks being tried out for molecular generation. So far this has mostly focused on latent variable space models like autoencoders, but today I shifted attention to a different approach rooted in reinforcement learning. The goal of most of these methods is 1) to build a generative model that can sample plausible molecular structures, but more saliently 2) specifically generate molecules optimized to exhibit some property of interest. The two autoencoder methods I read about did this by building a model to predict properties from latent space, and then optimizing the latent space vector to push up the value of those predictions. A central difficulty of this, and something that was a challenge for the autoencoder methods I read about, was the difficulty of explicitly incentivizing and promoting structurally valid molecular representations when going "off distribution" in search of molecules not in your training set that you predict will be better along some axis, since optimizing any direction - particularly a direction governed by a imperfect predictive model - without constraints is likely to lead to models that find the easy route of finding edge cases of your property-prediction model, rather than more difficult, truly valid and novel structures. https://i.imgur.com/NafoeDr.png An advantage of using reinforcement learning as a framework here is that, because your loss doesn't need to be a continuous analytic function of your outputs, you can explicitly add molecular validity, as calculated by some external program, as part of your reward signal. This allows you to penalize a model for optimizing away from valid outputs. The specific approach proposed by the authors of this paper has two phases of training. 1) A RNN sequence model trained to do character-by-character prediction of SMILES strings (a character-based molecular representation). This is just a probability distribution over SMILES strings, with no RL involved yet, and is referred to as the Prior. 2) Taking that pretrained sequence model, caching it, and then fine-tuning on top with a hybrid RL and maximum likelihood loss. As seen in the equation below, this loss creates a hybrid, posterior-esque likelihood that combines the probability of an action sequence (where an action is "the choice of next character given currently generated string") under the prior with the reward (or "score", S(A)) of the action sequence, and tries to push the probability under the learned policy to be closer to that hybrid likelihood. https://i.imgur.com/U4ZvKsJ.png https://i.imgur.com/b28Ea7m.png The notion here is that by including the prior in your RL loss, you keep your generated molecules closer to the learned molecular distribution, rather than letting it push towards edge cases that are improbable, but not in ways you were able to directly disincentivize with your reward function. This is an interesting framing of this problem as having two failure modes: generating molecules that are structurally invalid, in that they don't correspond to syntactically feasible representations, and generating molecules that are technically feasible but are unlikely under the actual distribution of molecules, which captures more nuanced and hard-to-hardcode facts about energetic feasibility. The authors experiment with three tasks: - Learning to avoid structures that contain sulphur (with a reward function that penalizes both invalid molecules and the presence of sulphur). On this task, they show that methods that make use of the prior (compared to ablations that are only incentivized to increase reward, or that are pulled towards the prior in a less direct way) do a better job of solving the problem in realistic ways rather than overly simplified ones. - Learning to generate structures with high similarity to a particular reference molecule. Here, they perform an interesting test where they remove the reference molecule and things similar to it from the training set of the Prior, which leads to the model not immediately falling into the easy solution of just generating exact copies of the reference molecule, but instead more interesting similar-but-not-identical analogues - Learning to generate structures that are predicted - by a separate predictive model - to be active against a target of interest. A similar Prior-limitation test was performed, where all the true positives from the bioactivity model are removed from sequence training, and this led to a more diverse set of solutions that did less of just mimicking the existing known positives Overall, while this paper was relatively straightforward from a machine learning perspective, I enjoyed it, thought the methods were a sensible improvement over prior work I'd read, and that the evaluations performed were an interesting test of some of the paper's ideas.