Science Inventory

Transparency in Modeling through Careful Application of OECD’s QSAR/QSPR Principles via a Curated Water Solubility Data Set

Citation:

Lowe, C., N. Charest, C. Ramsland, D. Chang, T. Martin, AND A. Williams. Transparency in Modeling through Careful Application of OECD’s QSAR/QSPR Principles via a Curated Water Solubility Data Set. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, 36(3):465-478, (2023). https://doi.org/10.1021/acs.chemrestox.2c00379

Impact/Purpose:

The goal of this product is to develop cheminformatics and computational chemistry tools and datasets to support environmental chemistry and toxicology. Prediction of key endpoints is often necessary due to the sparsity of available experimental toxicity and environmental data. Such models while less desirable than concrete measured values often provide the only quantitative or binary hazard metrics for the vast majority of chemicals in the environment. As interest in large scale evaluation and prioritization of potential toxicants increases, the development of reliable models following QSAR best practices accepted in the community for hazard endpoints is necessary to provide defensible hazard estimations to support risk -based prioritization. This product provides regulatory scientists, students and researchers with the ability to effectively access and exploit the many in silico data streams to support different regulatory purposes and supports current Agency efforts to reduce mammal study requests by 30% by 2025, and completely eliminate all mammal study requests and funding by 2035.

Description:

The need for careful assembly, training, and validation of quantitative structure–activity/property models (QSAR/QSPR) is more significant than ever as data sets become larger and sophisticated machine learning tools become increasingly ubiquitous and accessible to the scientific community. Regulatory agencies such as the United States Environmental Protection Agency must carefully scrutinize each aspect of a resulting QSAR/QSPR model to determine its potential use in environmental exposure and hazard assessment. Herein, we revisit the goals of the Organisation for Economic Cooperation and Development (OECD) in our application and discuss the validation principles for structure–activity models. We apply these principles to a model for predicting water solubility of organic compounds derived using random forest regression, a common machine learning approach in the QSA/PR literature. Using public sources, we carefully assembled and curated a data set consisting of 10,200 unique chemical structures with associated water solubility measurements. This data set was then used as a focal narrative to methodically consider the OECD’s QSA/PR principles and how they can be applied to random forests. Despite some expert, mechanistically informed supervision of descriptor selection to enhance model interpretability, we achieved a model of water solubility with comparable performance to previously published models (5-fold cross validated performance 0.81 R2 and 0.98 RMSE). We hope this work will catalyze a necessary conversation around the importance of cautiously modernizing and explicitly leveraging OECD principles while pursuing state-of-the-art machine learning approaches to derive QSA/PR models suitable for regulatory consideration.

Record Details:

Record Type:DOCUMENT( JOURNAL/ PEER REVIEWED JOURNAL)
Product Published Date:03/20/2023
Record Last Revised:04/25/2023
OMB Category:Other
Record ID: 357679