Science Inventory

Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls

Citation:

Schymanski, E., M. Stravs, T. Schulze, AND A. Williams. Automated Structure Annotation and Curation for MassBank: Potential and Pitfalls. Presented at American Chemical Society Spring National meeting, New Orleans, LA, March 18 - 22, 2018.

Impact/Purpose:

Abstract for presentation at the ACS Spring meeting. This presentation reflects on the effectiveness of the original RMassBank concept but also identifies pitfalls that automated structure annotation with open resources offers to streamline spectra contributions from external laboratories and users with widely ranging cheminformatics experience.

Description:

The European MassBank server (www.massbank.eu) was founded in 2012 by the NORMAN Network (www.norman-network.net) to provide open access to mass spectra of substances of environmental interest contributed by NORMAN members. The automated workflow RMassBank was developed as a part of this effort (Stravs et al 2013, DOI: 10.1002/jms.3131; https://github.com/MassBank/RMassBank/). This workflow included automated processing of the mass spectral data, as well as automated annotation using the SMILES, Names and CAS numbers provided by the user. Cheminformatics toolkits (e.g. Open Babel, rcdk) and web services (e.g. the CACTUS Chemical Identifier Resolver, Chemical Translation Services (CTS), ChemSpider, PubChem) were then used to convert and/or retrieve the remaining information for completion of the MassBank records (additional names, InChIs, InChIKeys, several database identifiers, mol files), to avoid excessive burden on the users and reduce the chance of errors. To date, approximately 16,000 MS/MS spectra (61 %*) corresponding with 1,269 (18 %*) unique chemicals (*of all open data as of Nov. 2016) have been uploaded to MassBank.EU via RMassBank. Curating the MassBank.EU records, as part of efforts to provide EPA CompTox Dashboard identifiers (DTXSIDs) for each record, revealed several issues in data quality. In addition, the representation of “ambiguous substances”, for example complex surfactant mixtures of various chain lengths and branching or incompletely-defined structures of transformaton products, is an ongoing challenge. While “ambiguous structures” cannot be represented in the majority of cheminformatics tools, we report on proof-of-concept solutions in this work. This presentation reflects on the effectiveness of the original RMassBank concept but also identifies pitfalls that automated structure annotation with open resources offers to streamline spectra contributions from external laboratories and users with widely ranging cheminformatics experience. Note: this work does not necessarily reflect U.S. EPA policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:03/22/2018
Record Last Revised:04/11/2018
OMB Category:Other
Record ID: 340219