Science Inventory

Open‑source QSAR models for pKa prediction using multiple machine learning approaches

Citation:

Mansouri, K., N. Cariello, A. Korotcov, V. Tkachenko, C. Grulke, C. Sprankle, D. Allen, W. Casey, N. Kleinstreuer, AND A. Williams. Open‑source QSAR models for pKa prediction using multiple machine learning approaches. Journal of Cheminformatics. Springer, New York, NY, 11(60):1-20, (2019). https://doi.org/10.1186/s13321-019-0384-1

Impact/Purpose:

The logarithmic acid dissociation constant pKa reflects the ionization of a chemical which affects lipophilicity, solubility, protein binding and ability of a chemical to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open source programs exist for this purpose. Using a freely available dataset and three machine learning approaches, we developed open source models for pKa prediction.

Description:

Experimental pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for QSAR modeling using KNIME and 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed that varied in the processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) Support Vector Machine (SVM) combined with k-Nearest Neighbors (kNN), (2) Extreme Gradient Boosting (XGB) and (3) Deep Neural Networks (DNN). The three methods delivered comparable performances on the training and test sets with a Root Mean Squared Error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work. This work provides multiple QSAR models to predict pKa, built using publicly available data, and provided as free and open source software on GitHub.

Record Details:

Record Type:DOCUMENT( JOURNAL/ PEER REVIEWED JOURNAL)
Product Published Date:09/18/2019
Record Last Revised:11/22/2019
OMB Category:Other
Record ID: 347559