Science Inventory

Determining the Predictive Limit of QSAR Models (QSAR 2021)

Citation:

Kolmar, S. AND Chris Grulke. Determining the Predictive Limit of QSAR Models (QSAR 2021). QSAR 2021 International Workshop on QSAR in Environmental and Health Sciences, Virtual, NC, June 07 - 10, 2021. https://doi.org/10.23645/epacomptox.15070269

Impact/Purpose:

Presentation to the QSAR 2021 International Workshop on QSAR in Environmental and Health Sciences June 2021. QSAR models provide an automated method for the estimation of all types of chemical safety relevant endpoints for data poor chemicals. To provide robust QSAR models to inform chemical evaluation, a set of best practices for modeling and dataset collection will be defined. These procedures will then be applied for endpoints with known value to the Agency in its chemical safety efforts including the prediction of toxicities, bioactivities, and environmental fate and physicochemical properties to support exposure modeling. Where appropriate, the predictive performance of models will be compared with current models being used by the program offices to ensure accuracy and fit for purpose. Finally, research into the interplay between dataset attributes (e.g., size, noisiness, curation level, source disparities) and model quality (predictive power) will be completed to better estimate the uncertainty of our predictions and to provide guidance in improving our QSAR modeling strategies in the future.

Description:

Quantitatively evaluating QSAR models is becoming more important and more challenging as the number of predictive models grows. The impact of experimental uncertainty on model evaluation has been recognized in the field, but a frequently held assumption is repeated throughout the literature: that a QSAR model can not predict more accurately than the data it is trained on. This study questions assumption by observing how the addition of simulated random error affects the prediction error for several common algorithms and for 7 diverse endpoints. First, an algorithm is trained on a dataset with added noise. Two main quantities are then calculated for comparison. The first is the root mean squared error (RMSE) of the predicted quantities versus the “noisy” experimental quantities. The second is the RMSE of the predicted quantities versus the original “true” data, which we term RMSEtrue. The comparison of these quantities then reports on how much better the algorithm can predict the true values versus the noisy values. The results of this study show that RMSE is always worse than RMSEtrue for the datasets and algorithms studied. The main conclusion is that QSAR models can make predictions which are actually more accurate than the noisy data on which they were trained; however, quantitatively assessing that accuracy using equally noisy validation sets obscures that truth. This conclusion has implications for many QSAR adjacent fields in which datasets have high levels of uncertainty, such as toxicology, and suggests that model predictions may be more accurate than previously thought. The views expressed in this abstract are those of the author(s) and do not necessarily reflect the views or policies of the US EPA.

URLs/Downloads:

DOI: Determining the Predictive Limit of QSAR Models (QSAR 2021)   Exit EPA's Web Site

SKOLMAR_051221_QSAR2021.PDF  (PDF, NA pp,  1194.822  KB,  about PDF)

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:06/10/2021
Record Last Revised:07/28/2021
OMB Category:Other
Record ID: 352425