Science Inventory

The influence of data curation on QSAR Modeling - examining issues of quality versus quantity of data (American Chemical Society)

Citation:

Mansouri, K., Chris Grulke, A. Richard, AND A. Williams. The influence of data curation on QSAR Modeling - examining issues of quality versus quantity of data (American Chemical Society). Presented at American Chemical Society, San Diego, CA, March 13 - 17, 2016. https://doi.org/10.23645/epacomptox.5058481

Impact/Purpose:

presentation at American Chemical Society meeting in San Diego, CA.

Description:

This presentation will examine the impact of data quality on the construction of QSAR models being developed within the EPA‘s National Center for Computational Toxicology. We have developed a public-facing platform to provide access to predictive models. As part of the work we have attempted to disentangle the influence of the quality versus quantity of data available to develop and validate QSAR models. We will present specific examples of data quality issues underlying the widely used EPISuite software that was initially developed over two decades ago. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation reviews both our manual and automated approaches to examining key datasets related to the EPISuite training and test data, including; approaches to validate across chemical structure representations (e.g., mol file and SMILES) and identifiers (chemical names and registry numbers) and approaches to standardize data into QSAR-consumable formats for modeling. Our efforts to quantify and segregate data into quality categories has allowed us to investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:03/17/2016
Record Last Revised:04/04/2016
OMB Category:Other
Record ID: 311660