Science Inventory

Development of skin sensitization, skin irritation, and eye irritation models using online data sources and Python-based machine learning

Citation:

Ramsland, C., G. Sinclair, T. Martin, AND A. Williams. Development of skin sensitization, skin irritation, and eye irritation models using online data sources and Python-based machine learning. American Chemical Society (ACS) Fall 2021 National Meeting, Atlanta, GA, August 22 - 26, 2021. https://doi.org/10.23645/epacomptox.17431424

Impact/Purpose:

Presentation to the American Chemical Society (ACS) Fall 2021 National Meeting August 2021. Development of skin sensitization models to reduce animal testing and important for chemical prioritization under TSCA.

Description:

In 2018, US EPA released a draft policy to reduce animal testing for skin sensitization. The goal of this study was to assemble experimental data from online data sources and develop QSAR (quantitative structure activity relationship) models to predict skin sensitization, skin irritation, and eye irritation. Data was extracted from a variety of online data sources including eChemPortal, NICEATM, QSAR Toolbox, and the open literature. Using Java code, the data was converted to a consistent data format and stored in an SQLite database. Each record was mapped to a unique substance ID in EPA’s Distributed Structure-Searchable Toxicology Database. The substance ID allows one to associate each record with a “QSAR-ready” SMILES string which is then used to generate molecular descriptors. Data set records consist of an ID value, a property value, and the molecular descriptor values. Records which contained the same two-dimensional inChiKey were merged. Discordant records were omitted and the data sets were randomly split into a training and prediction sets. For the skin irritation models, to account for corrosive behavior, two layers of binary classification were employed from intervals of the primary irritation index endpoint: distinguishing active vs. inactive substances, and then within the active set, distinguishing irritant vs. corrosive substances. Models were built using methods including random forest, support vector machines (SVM), XGBOOST, Deep Neural Networks (DNN), and k nearest neighbors (kNN). We optimized the hyperparameters for each model by selecting the set which performed best for internal cross validation of the training set or among many different external validation sets. We optimized the classification error, gamma and nu parameters for the SVM method and the learning rate, estimator count, and maximum depth for the XGBoost method. Consensus models averaging the results from the approaches listed above were also evaluated. The views expressed here are those of the authors and do not necessarily represent the views or the policies of the U.S. Environmental Protection Agency.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:08/26/2021
Record Last Revised:12/23/2021
OMB Category:Other
Record ID: 353763