Science Inventory

Systematic Approaches for the Encoding of Chemical Groups: A Case Study

Citation:

Karamertzanis, P., G. Patlewicz, M. Sannicola, K. Friedman, AND I. Shah. Systematic Approaches for the Encoding of Chemical Groups: A Case Study. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, 37(4):600-619, (2024). https://doi.org/10.1021/acs.chemrestox.3c00411

Impact/Purpose:

Study to investigate the feasibility of deriving a model to predict ECHA’s regulatory groupings. The dataset is publicly available already but we have mapped the dataset to DSSTox content meet our analysis needs

Description:

Regulatory authorities aim to organize substances into groups to facilitate prioritization within hazard and risk assessment processes. Often, such chemical groupings are not explicitly defined by structural rules or physicochemical property information. This is largely due to how these groupings are developed, namely, a manual expert curation process, which in turn makes updating and refining groupings, as new substances are evaluated, a practical challenge. Herein, machine learning methods were leveraged to build models that could preliminarily assign substances to predefined groups. A set of 86 groupings containing 2,184 substances as published on the European Chemicals Agency (ECHA) website were mapped to the U.S. Environmental Protection Agency (EPA) Distributed Toxicity Structure Database (DSSTox) content to extract chemical and structural information. Substances were represented using Morgan fingerprints, and two machine learning approaches were used to classify test substances into 56 groups containing at least 10 substances with a structural representation in the data set: k-nearest neighbor (kNN) and random forest (RF), that led to mean 5-fold cross-validation test accuracies (average F1 scores) of 0.781 and 0.853, respectively. With a 9% improvement, the RF classifier was significantly more accurate than KNN (p-value = 0.001). The approach offers promise as a means of the initial profiling of new substances into predefined groups to facilitate prioritization efforts and streamline the assessment of new substances when earlier groupings are available. The algorithm to fit and use these models has been made available in the accompanying repository, thereby enabling both use of the produced models and refitting of these models, as new groupings become available by regulatory authorities or industry.

Record Details:

Record Type:DOCUMENT( JOURNAL/ PEER REVIEWED JOURNAL)
Product Published Date:04/15/2024
Record Last Revised:05/31/2024
OMB Category:Other
Record ID: 361612