Science Inventory

The importance of data curation on QSAR Modeling - PHYSPROP open data as a case study. (QSAR 2016)

Citation:

Mansouri, K., Chris Grulke, A. Richard, R. Judson, AND A. Williams. The importance of data curation on QSAR Modeling - PHYSPROP open data as a case study. (QSAR 2016). Presented at QSAR 2016, Miami, Fl, June 13 - 17, 2016. https://doi.org/10.23645/epacomptox.5071291

Impact/Purpose:

Slide presentation at the QSAR 2016 meeting that will investigate the impact of data curation on the reliability of QSAR models being developed within the EPA‘s National Center for Computational Toxicology.

Description:

During the last few decades many QSAR models and tools have been developed at the US EPA, including the widely used EPISuite. During this period the arsenal of computational capabilities supporting cheminformatics has broadened dramatically with multiple software packages. These modern tools allow for more advanced techniques in terms of chemical structure representation and storage, as well as enabling automated data-mining and standardization approaches to examine and fix data quality issues.This presentation will investigate the impact of data curation on the reliability of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. As part of this work we have attempted to disentangle the influence of the quality versus quantity of data based on the Syracuse PHYSPROP database partly used by EPISuite software. We will review our automated approaches to examining key datasets related to the EPISuite data to validate across chemical structure representations (e.g., mol file and SMILES) and identifiers (chemical names and registry numbers) and approaches to standardize data into QSAR-ready formats prior to modeling procedures. Our efforts to quantify and segregate data into quality categories has allowed us to evaluate the resulting models that can be developed from these data slices and to quantify to what extent efforts developing high-quality datasets have the expected pay-off in terms of predicting performance. The most accurate models that we build will be accessible via our public-facing platform. This abstract does not reflect U.S. EPA policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:06/17/2016
Record Last Revised:07/10/2017
OMB Category:Other
Record ID: 336922