Science Inventory

Using machine learning to encode substance groups

Citation:

Karamertzanis, P., G. Patlewicz, M. Sannicola, K. Paul-Friedman, AND I. Shah. Using machine learning to encode substance groups. QSAR, Copenhagen, DENMARK, June 05 - 09, 2023. https://doi.org/10.23645/epacomptox.23527902

Impact/Purpose:

N/A

Description:

Grouping approaches to help inform chemical categories and associated read-across have been in practical use in regulatory programmes for many years. Under the auspices of the Toxic Substances Control Act (TSCA), the US Environmental Protection Agency (EPA) performs new chemical assessments using tools and approaches that include grouping chemicals with shared chemical and toxicological properties into categories. Candidate categories (NCC) have been proposed by New Chemical Program reviewers based on their experience in reviewing chemical assessments on related substances. Many of the existing 56 categories are amenable to be codified on the basis of structural rules and physical property information (e.g. LogKow, MW, water solubility). A preliminary analysis using the non confidential TSCA inventory (which could be represented by a discrete chemical structure) found that only 47% of substances were captured by the current set of categories. The European Chemicals Authority (ECHA) also generates groups of industrial chemicals to make inferences for possible risk management measures. For each of these groups, an “Assessment of Regulatory Needs” (ARNs) is carried out and approximately 100 such screening-level assessments have already been published. This helps European Union (EU) Authorities to conclude on the most appropriate regulatory actions to take (if needed) to address any potential or confirmed concerns that a substance within a group may pose. ECHA chemical groups have been generated based on an iterative approach beginning with chemical queries of relevant structural subfragments in the ECHA’s database. All retrieved candidate group members undergo expert review which takes into account chemistry, available hazard data and uses. The published groups correspond to ~2200 substances, whilst at present ~70 groups are being assessed every year with their publication being scheduled shortly after the assessment conclusion. Whilst the principles and criteria underpinning each group is available as narrative in dedicated sections of the public ARN documents, these have not been systematically encoded in a way that would facilitate their algorithmic implementation. As such, it is not currently possible to reproduce the same grouping approach to facilitate prospective screening and profiling of other inventories. It is also not possible to automate the allocation of newly registered substances under REACH, or substances notified under C&L or other legislative frameworks in EU, to already formed groups. Here we present progress on efforts to codify the ARN groupings for their systematic use. The overlap with the NCC categories is also being pursued to derive a consolidated set of chemical categories. CAS identifiers and names of the ARN groups were mapped to Distributed Structure-Searchable Toxicity (DSSTox) identifiers and linked content from EPA’s CompTox Chemicals Dashboard to retrieve structural information. Of the 2184 records extracted, DSSTox identifiers were available for 1850 substances. A number of substances were reaction mixtures or inorganics, but for 1284 substances, structural information was available that will permit several approaches to be investigated including the feasibility of deriving maximum common substructure fragments as well as supervised machine learning approaches including EPA’s Generalised Read-Across (GenRA) to predict grouping membership.

URLs/Downloads:

DOI: Using machine learning to encode substance groups   Exit EPA's Web Site

POSTER.PDF  (PDF, NA pp,  3793.38  KB,  about PDF)

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:06/09/2023
Record Last Revised:06/15/2023
OMB Category:Other
Record ID: 358118