Science Inventory

Development of models to predict physicochemical properties of PFAS

Citation:

Martin, T., G. Sinclair, C. Ramsland, AND A. Williams. Development of models to predict physicochemical properties of PFAS. 2021 Fall ACS Meeting, Cincinnati, OH, August 22 - 26, 2021. https://doi.org/10.23645/epacomptox.15405891

Impact/Purpose:

Presentation to the American Chemical Society (ACS) Fall 2021 National Meeting August 2021. Gathering experimental data and building models to predict phychem properties for PFAS. These properties are important to the program office for registering chemicals and PFAS chemicals have limited data available in the literature.

Description:

A vast amount of data is publicly available on the Internet. However, these data are often uncurated, unreferenced, and distributed across many data sources. Additionally, despite the proliferation of data, certain classes of chemicals remain poorly characterized experimentally, notably including per- and polyfluoroalkyl substances (PFAS). Our goal was to develop QSAR (quantitative structure activity relationship) models to predict physicochemical properties for PFAS to fill these data gaps. Data was extracted from a variety of online data sources including PubChem, eChemPortal, LookChem, OCHEM, EPISUITE, and others. Data was also extracted from the open literature. Using Java code, the data was converted to a consistent data format and stored in an SQLite database. Each record was mapped to a unique substance ID in EPA’s Distributed Structure-Searchable Toxicology Database. The substance ID allows one to associate each record with a “QSAR-ready” SMILES string which is then used to generate molecular descriptors. Data set records consist of an ID value, a property value, and the molecular descriptor values. Records which contained the same two-dimensional inChiKey were merged using the median property value. A five-fold splitting analysis was performed on the overall data set. Records which were poorly predicted were double checked in terms of the original data. Invalid records (e.g. solubility in the wrong solvent) were removed. After invalid records were removed, each data set was randomly split into a training and prediction set. For each endpoint, two sets of models were built: (1) local models using only PFAS in the training set and (2) global models using all available compounds. Results were compared for the PFAS in the prediction set. Models were built using methods including random forest, support vector machines, and k nearest neighbors. Consensus models averaging the results from the other approaches were also evaluated. The views expressed here are those of the authors and do not necessarily represent the views or the policies of the U.S. Environmental Protection Agency.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:08/26/2021
Record Last Revised:08/25/2021
OMB Category:Other
Record ID: 352642