Science Inventory

The Role of Feature Selection and Statistical Weighting in Predicting In Vivo Toxicity Using In Vitro Assay and QSAR Data (SOT)

Citation:

Wignall, J., M. Martin, A. Varghese, AND J. Trgovcich. The Role of Feature Selection and Statistical Weighting in Predicting In Vivo Toxicity Using In Vitro Assay and QSAR Data (SOT). Presented at Society of Toxicology annual meeting, New Orleans, LA, March 13 - 17, 2016. https://doi.org/10.23645/epacomptox.5155681

Impact/Purpose:

Poster presentation at the SOT 2016 annual meeting. Our study assesses the value of both in vitro assay and quantitative structure activity relationship (QSAR) data in predicting in vivo toxicity using numerous statistical models and approaches to process the data.

Description:

Our study assesses the value of both in vitro assay and quantitative structure activity relationship (QSAR) data in predicting in vivo toxicity using numerous statistical models and approaches to process the data. Our models are built on datasets of (i) 586 chemicals for which both in vitro and in vivo data are currently available in EPA’s Toxcast and ToxRefDB databases, respectively, and (ii) 769 chemicals for which both QSAR data and in vivo data exist. Similar to a previous study (based on just 309 chemicals, Thomas et al. 2012), after converting the continuous values from each dataset to binary values, the majority of more than 1,000 in vivo endpoints are poorly predicted. Even for the endpoints that are well predicted (about 40 with an F1 score of >0.75), imbalances in in vivo endpoint data or cytotoxicity across in vitro assays may be skewing results. In order to better account for these types of considerations, we examine best practices in data preprocessing and model fitting in real-world contexts where data are rife with imperfections. We discuss options for dealing with missing data, including omitting observations, aggregating variables, and imputing values. We also examine the impacts of feature selection (from both a statistical and biological perspective) on performance and efficiency, and we weight outcome data to reduce endpoint imbalances to account for potential chemical selection bias and assess revised performance. For example, initial weighting strategies decrease the number of models with an F1 score >0.75 drastically (to 6), but these models are more able to predict nontoxic chemicals in certain contexts. The results of these analyses can be used to inform screening or other decisions, especially in the context of future data enhancements, such as more biologically relevant in vitro assays, additional in vivo endpoint data, and extension of chemical space.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/17/2016
Record Last Revised:05/24/2017
OMB Category:Other
Record ID: 336399