By Boris Sattarov
The Computational Science Lab, a research group at Universitat Pompeu Fabra, lead by Gianni De Fabritiis (Acellera’s CEO) recently participated in the 2019 D3R SAMPL6 challenge, comprised of predicting octanol / water partition coefficients (logP) of chemical compounds. logP measures the difference in solubility of a compound in two different unmixable phases. If these two phases are water and a highly hydrophobic solvent (octanol in this case), logP will tell us how hydrophobic or hydrophilic that compound is, which is one of the several properties that can determine the viability of a drug candidate. Hence, predicting accurately this property will help medicinal chemists to take better decisions.
Our group has extensive experience in applying machine learning (ML) algorithms to natural sciences problems. We therefore decided to use ML to solve the problem of logP prediction and to do that we needed some training data, i.e. molecules with experimentally measured annotated logP values. For this purpose, we used EPI Kow dataset (from EPA’s OPERA toolkit), from which we extracted chemical structures with experimentally measured logP as label. This training set was pre-processed in generally accepted QSAR-ready matter (desalting, metals disconnected, NO2 groups standardized etc). Furthermore, we filtered out compounds that are dissimilar to the 11 target ones (Tanimoto index on ECFP4 fingerprints lower than 0.05) from the training set. This allowed us to easily get rid compounds that are too far away from the target ones in the chemical space and might disrupt the training. In this way, preprocessing pipeline resulted in ~12746 compounds in the training set.
In the final data preparation step, we featurized molecules in the training set. This means we’ve transformed the initial molecular structures in SMILES format into a suitable feature space which can be used as an input to the machine learning algorithm of choice. There are multiple possibilities to do this, namely: smiles-based seq2seq fingerprints, graph-based representations, 3D voxelization, 2D pictures of the compounds, etc. We’ve decided to use a combination of several types of fingerprints(ECFP4 + Avalon1024 + MACCS keys), and 199 available descriptors in RDKit(MolWt, Chi’s, Kappa, BalabanJ etc.)
Finally, we trained the ML model. In this case, we used XGBoost library implementation of extreme gradient boosting trees-based method. 10-fold cross validation was used to evaluate the performance of the model in logP prediction (Q^2, RMSE, MAE).
Our best submission (code: gmoq5) of 11 compounds logP prediction ranked 2nd out of 90 others in mean RMSE (0.39) and achieved mean 0.74 R^2 determination coefficient calculated by bootstrapping with 10000 samples. We want to highlight that this challenge results are of high stochasticity due to the low amount of compounds that are predicted (in this case only 11). This leads to the overlap of the 95% confidence intervals of approximately top 20 submissions in mean RMSE. All results are available here.
Results of the challenge suggest that simple data driven approaches like machine learning-based QSPR models are still extremely useful tools in compounds lipophilicity prediction and might serve as a reasonable external benchmark for advanced ab initio QM and MM methods.