Science Inventory

Scrubchem: Building Bioactivity Datasets from Pubchem Bioassay Data (SOT)

Citation:

Harris, Jason Bret AND R. Judson. Scrubchem: Building Bioactivity Datasets from Pubchem Bioassay Data (SOT). Presented at Society of Toxicology Meeting, New Orleans, LA, March 13 - 17, 2016. https://doi.org/10.23645/epacomptox.5176495

Impact/Purpose:

Poster presentation at the SOT 2016 annual meeting. Critical assay annotations have been added to PubChem Bioassay records to coalesce hit calls for chemicals tested in different assays.

Description:

The PubChem Bioassay database is a non-curated public repository with data from 64 sources, including: ChEMBL, BindingDb, DrugBank, EPA Tox21, NIH Molecular Libraries Screening Program, and various other academic, government, and industrial contributors. Methods for extracting this public data into quality datasets, useable for analytical research, presents several big-data challenges for which we have designed manageable solutions. According to our preliminary work, there are approximately 549 million bioactivity values and related meta-data within PubChem that can be mapped to over 10,000 biological targets. However, this data is not ready for use in data-driven research, mainly due to lack of structured annotations.We used a pragmatic approach that provides increasing access to bioactivity values in the PubChem Bioassay database. This included restructuring of individual PubChem Bioassay files into a relational database (ScrubChem). ScrubChem contains all primary PubChem Bioassay data that was: reparsed; error-corrected (when applicable); enriched with additional data links from other NCBI databases; and improved by adding key biological and assay annotations derived from logic-based language processing rules. The utility of ScrubChem and the curation process were illustrated using an example bioactivity dataset for the androgen receptor protein. This initial work serves as a trial ground for establishing the technical framework for accessing, integrating, curating, analyzing, and making use of such massive bioactivity data. This abstract does not necessarily reflect U.S. EPA policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/17/2016
Record Last Revised:05/24/2017
OMB Category:Other
Record ID: 336386