Science Inventory

An Analysis of Overfitting in Modern QSAR Models

Citation:

Charest, N., G. Sinclair, C. Ramsland, T. Martin, AND A. Williams. An Analysis of Overfitting in Modern QSAR Models. Fall ACS, Chicago, IL, August 21 - 25, 2022. https://doi.org/10.23645/epacomptox.20505615

Impact/Purpose:

N/A

Description:

‘Overfitting’ is a phenomenon in which a mathematical model attains high performance on its training data by ‘memorizing’ its points. This is reflected in significant differences in the statistical performance of a model in predicting its training data and predicting external validation data unseen during the model’s training. Conventional wisdom suggests such models will necessarily internalize noise or spurious correlations within the training data, and that models which achieve parity between training and validation performance have a reached a more desirable state of generalization. Typically, such generalized models are observed to perform better on their external sets, while the overfit models perform worse. An empirical observation of QSA/PR (Quantitative Structure Activity/Property Relationship) models which, when more ‘overfit’, perform better on external data is made across numerous endpoints, suggesting a trend that is at odds with modeling best practices. This project analyzes possible causes for and attempts to reconcile the observed behavior with the theories of computational modeling. We consider the possibility of multiple signals within the data describing different mechanisms generating the endpoint, the influence of coverage within chemical space, and the relevance of hyperparameterization to model fitting and performance. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:08/25/2022
Record Last Revised:08/30/2022
OMB Category:Other
Record ID: 355573