Science Inventory

Categorizing Continuous Data in QSAR

Citation:

Kolmar, S. Categorizing Continuous Data in QSAR. Spring ACS National Meeting, San Diego, CA, March 20 - 24, 2022. https://doi.org/10.23645/epacomptox.19333196

Impact/Purpose:

QSAR models provide an automated method for the estimation of all types of chemical safety relevant endpoints for data poor chemicals.  To provide robust QSAR models to inform chemical evaluation, a set of best practices for modeling and dataset collection will be defined.  These procedures will then be applied for endpoints with known value to the Agency in its chemical safety efforts including the prediction of toxicities, bioactivities, and environmental fate and physicochemical properties to support exposure modeling.  Where appropriate, the predictive performance of models will be compared with current models being used by the program offices to ensure accuracy and fit for purpose. Finally, research into the interplay between dataset attributes (e.g., size, noisiness, curation level, source disparities) and model quality (predictive power) will be completed to better estimate the uncertainty of our predictions and to provide guidance in improving our QSAR modeling strategies in the future.

Description:

The growing application of Quantitative Structural Activity Relationship (QSAR) principles to the field of computational toxicology motivates many QSAR modelers to present binary predictions, rather than continuous predictions, to their toxicological audiences. Additionally, the idea that categorization of continuous data mitigates the effect of systematic and random error has led to the practice of categorizing natively continuous data. On a fundamental statistical level, categorizing continuous data prior to algorithm training and prediction results in a significant loss of information and statistical power. This work discusses the fundamental statistics that make continuous data categorization bad practice and investigates how the categorization of continuous data affects the performance of QSAR models. Using several benchmark datasets and machine learning algorithms, models are compared in which continuous data has been categorized before and after algorithm training and prediction. Benchmark datasets include a quantum mechanical calculation of the free energy of atomization (ΔG0at), experimental enthalpy of hydration, and experimental in vitro AC50 from the U.S. EPA’s ToxCast database. Algorithms used in this work include decision trees, k-nearest neighbors, support vector machines, and random forest. The results suggest that the effect of categorization before and after algorithm training is dependent on the number of feature variables, the variance of each feature variable, the covariance matrix of the feature variables, as well as the algorithm and algorithm hyperparameters. This work does not necessarily reflect US EPA policy.

URLs/Downloads:

DOI: Categorizing Continuous Data in QSAR   Exit EPA's Web Site

SK_030922_ACS.PDF  (PDF, NA pp,  1191.319  KB,  about PDF)

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:03/24/2022
Record Last Revised:07/14/2022
OMB Category:Other
Record ID: 355251