Science Inventory

ScrubChem: Cleaning of PubChem Bioassay Data to Create Diverse and Massive Bioactivity Datasets for Use in Modeling Applications (SOT)

Citation:

Harris, Jason Bret, J. Harris, O. Isayev, A. Tropsha, AND R. Judson. ScrubChem: Cleaning of PubChem Bioassay Data to Create Diverse and Massive Bioactivity Datasets for Use in Modeling Applications (SOT). Presented at SOT, Baltimore, Maryland, March 12 - 16, 2017. https://doi.org/10.23645/epacomptox.5176861

Impact/Purpose:

Poster presentation at the SOT 2017 annual meeting that presents new data deconvolution process.

Description:

The PubChem Bioassay database is a non-curated public repository with bioactivity data from 64 sources, including: ChEMBL, BindingDb, DrugBank, Tox21, NIH Molecular Libraries Screening Program, and various academic, government, and industrial contributors. However, this data is difficult to use in data-driven research, mainly due to lack of interoperability and standardization among its 1.2 million assay records. Methods for extracting this public data into high-quality, computable datasets, useable for predictive and analytical research, presents several big-data challenges for which ScrubChem is being developed as a manageable solution. Our approach was to use logic-based text and language processing rules in order to digitally curate and correct the many issues related to the flexible deposition structure of PubChem (e.g., variable placement of biological target information, variable endpoint terminologies, result readouts not distinguished from non-result readouts, improper use of the null vs zero concept, incorrect spellings). Currently, ScrubChem contains approximately 680 million bioactivity values and related meta-data within PubChem and maps this data to over 10,000 biological targets and 2.1 million chemical structures. This work presents case issues identified and resolved through ScrubChem and provides an example dataset for the human androgen receptor with over 85,000 reference bioactivities to further illustrate the results of the cleaning process. This work does not necessarily reflect U.S. EPA or University of North Carolina policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/16/2017
Record Last Revised:07/16/2018
OMB Category:Other
Record ID: 339869