Science Inventory

The influence of data curation on QSAR Modeling – examining issues of quality versus quantity of data (SOT)

Citation:

Williams, A., K. Mansouri, A. Richard, AND Chris Grulke. The influence of data curation on QSAR Modeling – examining issues of quality versus quantity of data (SOT). Presented at Society of Toxicology, New Orleans, LA, March 13 - 17, 2016. https://doi.org/10.23645/epacomptox.5176573

Impact/Purpose:

Poster Presentation at Society of Toxicology 2016 meeting

Description:

The construction of QSAR models is critically dependent on the quality of available data. As part of our efforts to develop public platforms to provide access to predictive models, we have attempted to discriminate the influence of the quality versus quantity of data available to develop and validate QSAR models. We have focused our efforts on the widely used EPISuite software that was initially developed over two decades ago. Specific examples of quality issues for the EPISuite data include multiple records for the same chemical structure with different measured property values, inconsistency between the structure, chemical name and CAS registry number for single records, the inability to convert the SMILES strings into chemical structures, hypervalency in the chemical structures and the absence of stereochemistry for thousands of data records. Relative to the era of EPISuite development, modern cheminformatics tools allow for more advanced capabilities in terms of chemical structure representation and storage, as well as enabling automated data validation and standardization approaches to examine data quality. This presentation will review both our manual and automated approaches to examining key datasets related to the EPISuite training and test data. This includes approaches to validate between chemical structure representations (e.g. molfile and SMILES) and identifiers (chemical names and registry numbers), as well as approaches to standardize the data into QSAR-consumable formats for modeling. We have quantified and segregated the data into various quality categories to allow us to thoroughly investigate the resulting models that can be developed from these data slices and to examine to what extent efforts into the development of large high-quality datasets have the expected pay-off in terms of prediction performance. This abstract does not reflect U.S. EPA policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/16/2016
Record Last Revised:03/18/2016
OMB Category:Other
Record ID: 311418