Science Inventory

Systematizing Data Gathering for PFAS Toxicity and Property Modeling

Citation:

Sinclair, G., C. Ramsland, T. Martin, AND A. Williams. Systematizing Data Gathering for PFAS Toxicity and Property Modeling. Society of Toxicology 2021 Virtual Annual meeting, Virtual, Virtual, March 12 - 26, 2021. https://doi.org/10.23645/epacomptox.14470800

Impact/Purpose:

A systematic way to gather experimental physicochemical property data for PFAS chemicals for QSAR model building

Description:

A vast amount of chemical toxicology and property data is publicly accessible via the Internet. However, these data are often uncurated, unreferenced, and distributed across many data sources. Additionally, despite the proliferation of data, certain classes of chemical remain poorly characterized experimentally, notably including per- and polyfluoroalkyl substances (PFAS). This project seeks to develop a systematic architecture to consolidate existing chemical data for use in quantitative structure-activity relationship (QSAR) modeling. Such a process will both increase the quality and quantity of data available to model the toxicology and properties of PFAS, as well as permitting identification of present gaps in data collection. Thirteen chemical data sources were selected, including academic (e.g. PubChem), governmental (e.g. ECHA), and commercial (e.g. LookChem) vendors. In addition, three sets of literature data were manually compiled for inclusion. Initially, all data were collected and stored in their original format. Using tailored processing tools developed in Java for the project, these data were translated to a structured intermediate JSON format retaining all original information. They were then standardized to a final JSON format which highlights the specific properties of interest for QSAR modeling and normalizes quantities including units, measurement methods, and remarks for each property. These 199,000+ standardized data points were then integrated into a single SQL database. The unique combination of chemical identifiers for each data point can be mapped to a substance ID in the EPA’s Distributed Structure-Searchable Toxicology Database, allowing that data to be associated to a QSAR-ready SMILES string in order to retrieve molecular descriptors for model development. This project furthers understanding of the gaps in available chemical property data. Of particular concern is the known problem of modeled data not disclosed as such, and thus improperly included in sources of experimental data; for instance, in the EPA’s High Production Volume Information System, only ~230 entries are appropriately labeled as estimated, but up to ~2,000 more include indications in remarks that the results may be predicted rather than experimentally determined. The consolidated data, curated procedurally for this and other problems, lays the groundwork for a broad effort to modeling the toxicological and physicochemical properties of PFAS, and the architecture and tools developed in support of it will permit ongoing expansion as new data continue to become available.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/26/2021
Record Last Revised:04/22/2021
OMB Category:Other
Record ID: 351454