In the last three years, Transformers, or models based entirely on attention for aggregating information from across multiple places in a sequence, have taken over the world of NLP. In this paper, the authors propose using a Transformer to learn a molecular representation, and then building a model to predict drug/target interaction on top of that learned representation. A drug/target interaction model takes in two inputs - a protein involved in a disease pathway, and a (typically small) molecule being considered as a candidate drug - and predicts the binding affinity they'll have for one another. If binding takes place between the two, that protein will be prevented from playing its role in the chain of disease progression, and the drug will be effective. The mechanics of the proposed Molecular Transformer DTI (or MT-DTI) model work as follows: https://i.imgur.com/ehfjMK3.png - Molecules are represented as SMILES strings, a character-based representation of atoms and their structural connections. Proteins are represented as sequences of amino acids. - A Transformer network is constructed over the characters in the SMILES string, where, at each level, the representation of each character comes from taking an attention-weighted average of the representations at other positions in the character string at that layer. At the final layer, the aggregated molecular representation is taken from a special "REP" token. - The molecular transformer is pre-trained in BERT-like way: by taking a large corpus (97M molecules) of unsupervised molecular representations, masking or randomizing tokens within the strings, and training the model to predict the true correct token at each point. The hope is that this task will force the representations to encode information relevant to the global structures and chemical constraints of the molecule in question - This pre-trained Transformer is then plugged into the DTI model, alongside a protein representation model in the form of multiple layers convolutions acting on embedded representations of amino acids. The authors noted that they considered a similar pretrained transformer architecture for the protein representation side of the model, but that they chose not to because (1) the calculations involved in attention are N^2 in the length of the sequence, and proteins are meaningfully longer than the small molecules being studied, and (2) there's no comparably large dataset of protein sequences that could be effectively used for pretraining - The protein and molecule representations are combined with multiple dense layers, and then produce a final affinity score. Although the molecular representation model starts with a set of pretrained weights, it also fine tunes on top of them. https://i.imgur.com/qybLKvf.png When evaluated empirically on two DTI datasets, this attention based model outperforms the prior SOTA, using a convolutional architecture, by a small but consistent amount across all metrics. Interestingly, their model is reasonably competitive even if they don't fine-tune the molecular representation for the DTI task, but only pretraining and fine-tuning together get the MT-DTI model over the threshold of prior work.