Science Inventory

Ensemble QSAR Modeling to Predict Multispecies Fish Toxicity Points of Departure

Citation:

Sheffield, T. AND R. Judson. Ensemble QSAR Modeling to Predict Multispecies Fish Toxicity Points of Departure. Presented at Society of Toxicology, San Antonio, TX, March 12 - 15, 2018.

Impact/Purpose:

Poster for Society of Toxicology Meeting March 2018

Description:

Due to the large quantity of new chemicals being developed and potentially introduced into aquatic ecosystems, there is a need to prioritize chemicals with the greatest likelihood of ecological hazard for further research. To this end, a useful in silico estimation of ecotoxicity can be provided by quantitative structure-activity relationship (QSAR) models. We utilized the ECOTOX database to build a QSAR model predicting toxicological endpoints for a wide array of different chemicals and fish species. Because our desire is to prioritize chemicals for further study, our model maximizes domain of applicability at the cost of prediction accuracy. A QSAR model was built to predict 50% lethal concentration (LC50) in mg/m3 using experimental data from 347 species of fish, 2095 chemicals and 33,282 total experimental values. The top 50 fish species accounted for 88% of the data, with the most common species (rainbow trout, bluegill, and fathead minnow) accounting for 41.7% of the data. Chemical inputs for the model were OPERA chemical property predictions as well as ToxPrint chemical fingerprints. To maximize prediction robustness, we used an ensemble of machine learners with statistically comparable accuracies, including random forest, support vector regression, k nearest neighbors, and neural nets. Likewise, we used a range of feature selection methods and model parameters. Additionally, the model was bootstrapped by repeatedly sampling different experimental data for individual chemicals. Fivefold cross-validation, as well as leave-one-out external validation were used to evaluate model performance. These validation approaches yielded a root-mean-squared error of .91 for the log 10 of LC50. The coefficient of determination (R2) for the model predictions is .58. The bootstrapped predictions fall within two standard deviations of at least one experimental value 57.7% of the time. These results suggest that aggregating various kinds of chemicals and species yields predictions generally within an order of magnitude of the observed result, consistent with the inherent experimental variation; however, further refinements may increase model accuracy. This abstract does not necessarily represent EPA policy.

URLs/Downloads:

SHEFFIELD 2018 SOT RSJ.PDF  (PDF, NA pp,  428.188  KB,  about PDF)

SHEFFIELD_SOT_ABSTRACT V4_JC.PDF  (PDF, NA pp,  80.082  KB,  about PDF)

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/15/2018
Record Last Revised:07/19/2018
OMB Category:Other
Record ID: 341060