Science Inventory

Establishing Best Practices for Water Solubility Dataset Curation

Citation:

Lowe, C., G. Sinclair, C. Ramsland, T. Martin, Chris Grulke, AND A. Williams. Establishing Best Practices for Water Solubility Dataset Curation. American Chemical Society (ACS) Fall 2021 National Meeting, Virtual, NC, August 22 - 26, 2021. https://doi.org/10.23645/epacomptox.15420534

Impact/Purpose:

Presentation to the American Chemical Society (ACS) Fall 2021 National Meeting August 2021. The US-EPA Center for Computational Toxicology and Exposure (CCTE) has been generating data and building software applications and web-based chemistry databases for over a decade. To support our efforts to develop new approaches to prioritize chemicals based on potential human health risks, we aggregate and curate data streams of various types to support prediction models. Despite data collection efforts there will always be experimental data gaps for certain chemicals. QSAR (quantitative structure-activity relationship) models are often employed to fill data gaps. QSAR Models will be developed using a variety of machine learning approaches. The results of these efforts will be of direct benefit to program and regional offices as well as the greater scientific community.

Description:

There is currently a plethora of water solubility datasets available in publicly available resources. The ease of accessing these data, along with the overall quality of the datasets (i.e. machine-readable formatting, inclusion of experimental conditions, etc.) is highly variable. There has been a number of issues discovered during the process of assembling, integration, and review of these datasets including instances of conflicting chemical identifiers, incorrect structural representation, and the presence of multicomponent mixtures masquerading as single molecules. The rectification of these discrepancies will be shown to lead to a significant improvement in QSPR models. Our intention is to develop standard workflows and provide guidance detailing how to correct for the observed issues. This workflow will ultimately be extended for the curation of other physicochemical property datasets and ideally extended to environmental fate and transport data and other relevant chemical datasets. The culmination of this work is a curated water solubility dataset for over 50,000 unique organic compounds from nine online databases, totaling over 100,000 measurements. Machine learning QSPR modeling results will also be presented to show the importance of curation of both the chemical identifiers and solubility values. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:08/26/2021
Record Last Revised:08/25/2021
OMB Category:Other
Record ID: 352643