Science Inventory

Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure-Activity Relationship and Machine Learning Methods


Zang, Q., D. Rotroff, AND R. Judson. Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure-Activity Relationship and Machine Learning Methods. Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, 53:3244−3261, (2013).


paper to model estrogenic activity via QSAR and machine learning


ABSTRACT: There are thousands of environmental chemicals subject to regulatory decisions for endocrine disrupting potential. A promising approach to manage this large universe of untested chemicals is to use a prioritization filter that combines in vitro assays with in silico QSAR models to identify putative active compounds for specific pathways. These can then be followed-up with more detailed in vitro and in vivo tests. The ToxCast and Tox21 programs have tested ~8,200 chemicals in a broad screening panel of in vitro high-throughput screening (HTS) assays for estrogen receptor (ER) agonist and antagonist activity, as well as for multiple targets related to other adverse outcome pathways. The present work uses this large in vitro data set to develop in silico QSAR models using machine learning (ML) methods and a novel approach to manage the imbalanced data sets seen in all targets we have tested. This basic approach is designed to be readily extensible to other targets of ToxCast/Tox21 testing. Training compounds from the ToxCast project were classified as active or inactive (binding or non-binding) based on a composite ER Interaction Score derived from a collection of 13 ER in vitro assays using a variety of readout technologies and cell types. A total of 1,537 chemicals from ToxCast were used to derive and optimize the binary classification models while 5,073 additional chemicals from the Tox21 project, evaluated in 2 of the 13 in vitro assays, were used to externally validate the model performance. The imbalanced distribution of active and inactive chemicals within the HTS data set can result in over- and under-sampling biases. Therefore, we developed a cluster-selection strategy to minimize information loss and increase predictive performance, and compared this strategy to three currently popular techniques for handling imbalanced data sets in ML applications: cost-sensitive learning, over-sampling of the minority class, and under-sampling of the majority class. QSAR classification models were built to relate the molecular structures of chemicals to their ER activities using linear discriminant analysis (LDA), classification and regression trees (CART), and support vector machines (SVM) with 51 molecular descriptors from QikProp and 4328 structural fingerprints as explanatory variables. A random forest (RF) feature selection method was used to extract the structural features most relevant to ER activity. The ML methods were investigated and compared regarding their simplicity, interpretability and predictive ability. The performance was evaluated using various metrics, including overall accuracy, sensitivity, specificity, G-mean, as well as area under the receiver operating characteristic (ROC) curve (AUC). The best model was obtained using SVM in combination with a set of descriptors identified from a large set via the RF algorithm, which recognized the active and inactive compounds at the accuracies of 76.1% and 82.8% with a total accuracy of 81.6% on the internal test set and 70.8% on the external test set. These results demonstrate that a combination of high-quality experimental data and ML methods can lead to robust models that achieve excellent predictive accuracy, which are potentially useful for facilitating the virtual screening of chemicals for environmental risk assessment. Disclaimer: The views expressed in this article are those of the authors and do not necessarily reflect the views of policies of the U.S. Environmental Protection Agency. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.

Record Details:

Product Published Date: 11/26/2013
Record Last Revised: 02/23/2015
OMB Category: Other
Record ID: 306571