[link]
This paper's proposed method, the cleverly named ORGAN, combines techniques from GANs and reinforcement learning to generate candidate molecular sequences that incentivize desirable properties while still remaining plausibly ondistribution. Prior papers I've read on molecular generation have by and large used approaches based in maximum likelihood estimation (MLE)  where you construct some distribution over molecular representations, and maximize the probability of your true data under that distribution. However, MLE methods can't be less powerful when it comes to incentivizing your model to precisely conform with structural details of your target data distribution. Generative Adversarial Networks (GANs) on the other hand, use a discriminator loss that directly penalizes your generator for being recognizably different from the true data. However, GANs have historically been difficult to use on data like the stringbased molecular representations used in this paper. That's because strings are made up of discrete characters, which need to be sampled from underlying distributions, and we don't naively have good ways of making sampling from discrete distributions a differentiable process. SeqGAN was proposed to remedy this: instead of using the discriminator loss directly as the generator's loss  which would require backpropogating through the sampling operation  the generator is trained with reinforcement learning, using the discriminator score as a reward function. Each addition of an element to the sequence  or, in our case, each addition of a character to our molecular representation string  represents an action, and full sequences are rewarded based on the extent to which they resemble true sequences. https://i.imgur.com/dqtcJDU.png This paper proposes taking that model as a base, but then adding a more actuallyrewardoriented reward onto it, incentivizing the model to produce molecules that have certain predicted properties, as determined by a (also very not differentiable) external molecular simulator. So, just taking a weighted sum of discriminator loss and reward, and using that as your RL reward. After all, if you're already using the policy gradient structures of RL to train the underlying generator, you might as well add on some more traditionallooking RL rewards. The central intuition behind using RL in both of these cases is that it provides a way of using training signals that are unknown or  more to the point  nondifferentiable functions functions of model output. In their empirical tests focusing on molecules, the authors target the RL to optimize for one of solubility, synthesizability, and druggability (three welldefined properties within molecular simulator RDKit), as well as for uniqueness, penalizing any molecule that had been generated already. https://i.imgur.com/WszVd1M.png For all that this is an interesting mechanic, the empirical results are more equivocal. Compared to Naive RL, which directly optimizes for reward without the discriminator loss, ORGAN (Or, ORWGAN, the betterperforming method using a Wasserstein GAN) doesn't have notably better rates of how often its generated strings are valid, and (as you would expect) performs comparably or slightly worse when it comes to optimizing the underlying reward. It does exhibit higher diversity than naive RL on two of the three tasks, but it's hard to get an intuition for the scales involved, and how much that scale of diversity increase would impact real results.
Your comment:
