Science Inventory

Automated workflows for data curation and standardization of chemical structures for QSAR modeling

Citation:

Mansouri, K., A. McEachran, Chris Grulke, A. Richard, R. Judson, AND A. Williams. Automated workflows for data curation and standardization of chemical structures for QSAR modeling. American Chemical Society Spring Meeting, New Orleans, LA, March 18 - 22, 2018.

Impact/Purpose:

Abstract for presentation at the ACS Spring meeting. Here we describe the development of automated KNIME workflows to both assist in the curation of data and to standardize the chemical structures according to a set of standard rules.

Description:

Large collections of chemical structures and associated experimental data are publicly available, and can be used to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of automated KNIME workflows to both assist in the curation of data and to standardize the chemical structures according to a set of standard rules. The publicly available PHYSPROP physicochemical properties and environmental fate datasets were used as case studies to reveal commonly encountered errors and develop a set of rules to correct them. The workflow first assembles structure–identity pairs using up to four provided chemical identifiers, including chemical names, CASRNs, SMILES, and MolBlocks. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a structure standardization KNIME workflow was used to generate “QSAR-ready” forms prior to calculating molecular descriptors. This workflow performs a series of operations on the 2D structures including desalting, stripping stereochemistry, standardizing tautomers and nitro groups, correcting valence, neutralizing when possible and removing duplicates. A machine learning procedure was applied to evaluate the impact of this curation process. The models based on the curated data and standardized structures showed statistically improved predictive performance. These workflows were used to curate and standardize the full list of PHYSPROP datasets that were used to develop OPERA models available on the EPA’s CompTox Chemistry Dashboard (https://comptox.epa.gov). They were also applied on thousands of other datasets that were used in international consortiums such as CERAPP and CoMPARA. The QSAR-ready workflow was modified to generate “MS-ready structures” to support mass spectrometry non-targeted analysis. All workflows, data and models are open-source and freely available on GitHub (https://github.com/kmansouri) for further usage and integration by the scientific community. This work does not reflect U.S. EPA policy.

URLs/Downloads:

AUTOMATED_WORKFLOWS_ACS2018_AJW_ABSTRACT.PDF  (PDF, NA pp,  48.958  KB,  about PDF)

AUTOMATED WORKFLOWS_FINAL.PDF  (PDF, NA pp,  2142.233  KB,  about PDF)

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:03/22/2018
Record Last Revised:05/16/2018
OMB Category:Other
Record ID: 340232