This is a paper released by the creators of the DeepChem library/framework, explaining the efforts they've put into facilitating straightforward and reproducible testing of new methods. They advocate for consistency between tests on three main axes. 1. On the most basic level, that methods evaluate on the same datasets 2. That they use canonical train/test splits 3. That they use canonical metrics. To that end, they've integrated a framework they call "MoleculeNet" into DeepChem, containing standardized interfaces to datasets, metrics, and test sets. **Datasets** MoleculeNet contains 17 different datasets, where "dataset" here just means a collection of data labeled for a certain task or set of tasks. The tasks fall into one of four groups: - quantum mechanical prediction (atomization energy, spectra) - prediction of properties of physical chemistry (solubility, lipophilicity) - prediction of biophysical interactions like bonding affinity - prediction of human-level physiological properties (toxicity, side effects, whether it passes the blood brain barrier) An interesting thing to note here is that only some datasets contain 3D orientations of molecules, because spatial orientations are properties of *a given conformation* of a molecule, and while some output measures (like binding geometry) depend on 3D arrangement, others (like solubility) don't. **Metrics** The metrics chosen were pretty straightforward - Root Mean Squared Error or Absolute Error for continuous prediction tasks, and ROC-AUC or PRC-AUC for prediction ones. The only notable nuance was that the paper argued for PRC-AUC as the standard metric for datasets with a low number of positives, since that metric is the strictest on false positives. **Test/Train Split** Most of these were fairly normal - random split and time-based split - but I found the idea of a scaffold split (where you cluster molecules by similarity, and assign each cluster to either train or test), interesting. The idea here is that if molecules are similar enough to one another, seeing one of a pair during training might be comparable to seeing an actual shared example between training and test, and have the same propensity for overconfident results. **Models** DeepChem has put together implementations of a number of standard machine learning methods (SVM, Random Forest, XGBoost, Logistic Regression) on molecular features, as well as a number of molecule-specific graph-structured methods. At a high level, these are: https://i.imgur.com/x4yutlp.png - Graph Convolutions, which update atom representations by combining transformations of the features of bonded neighbor atoms - DAGs, which create an "atom-centric" graph for each atom in the molecule and "pull" information inwards from farther away nodes (for the record, I don't fully follow how this one works, since I haven't read the underlying paper) - Weave Model, which maintains both atom representations and pair representations between all pairs of atoms, not just ones bonded to one another, and updates each in a cross-cutting way: updating an atom representation from all of its pairs (as well as itself), and then updating a pair representation from the atoms in its pairing (as well as itself). This has the benefit of making information from far-away molecules available immediately, rather than having to propagate through a graph, but is also more computationally taxing - Message Passing Neural Network, which operates like Graph Convolutions except that the feature transform used to pull in information from neighboring atoms changes depending on the type of the bond between atoms - Deep Tensor Neural Network - Instead of bonds, this approach represents atoms in 3D space, and pulls in information based on other atoms nearby in spatial distance **Results** As part of creating its benchmark, MoleculeNet also tested its implementations of its models on all its datasets. It's interesting the extent to which the results form a narrative, in terms of which tasks benefit most from flexible structure-based methods (like graph approaches) vs hand-engineered features. https://i.imgur.com/dCAdJac.png Predictions of quantum mechanical properties and properties of physical chemistry do consistently better with graph-based methods, potentially suggesting that the features we've thought to engineer aren't in line with useful features for those tasks. By contrast, on biophysical tasks, hand-engineered features combined with traditional machine learning mostly comes out on top, a fact I found a bit surprising, given the extent to which I'd read about deep learning methods claiming strong results on prediction of things like binding affinity. This was a useful pointer of things I should do some more work to resolve clarity on. And, when it came to physiological properties like toxicity and side effects, results are pretty mixed between graph-based and traditional methods.