Science Inventory

An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling

Citation:

Mansouri, K., Chris Grulke, A. Richard, R. Judson, AND A. Williams. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modeling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor & Francis, Inc., Philadelphia, PA, 27(11):911-937, (2016).

Impact/Purpose:

Here we describe the development of a computational workflow, based on the KNIME platform, to curate and correct errors in the structure and identity of chemicals in these data sets, based on comparison of names, SMILES and MolBlock records.

Description:

Increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publically available PHYSPROP physico-chemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers, and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest quality subset of the original dataset was compared to the larger curated and corrected data set. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publically available for further usage and integration by the scientific community.

URLs/Downloads:

http://dx.doi.org/10.1080/1062936X.2016.1253611   Exit

Record Details:

Record Type: DOCUMENT (JOURNAL/PEER REVIEWED JOURNAL)
Product Published Date: 11/25/2016
Record Last Revised: 09/27/2017
OMB Category: Other
Record ID: 337721

Organization:

U.S. ENVIRONMENTAL PROTECTION AGENCY

OFFICE OF RESEARCH AND DEVELOPMENT

NATIONAL CENTER FOR COMPUTATIONAL TOXICOLOGY