Science Inventory

Developing data provenance approaches in ToxValDB: An IRIS Case Study

Citation:

Webb, A., J. Hope, M. Groover, J. Wall, R. Judson, AND R. Sayre. Developing data provenance approaches in ToxValDB: An IRIS Case Study. Society of Toxicology (SOT) 63rd Annual Meeting and ToxExpo, Salt Lake City, UT, March 10 - 14, 2024. https://doi.org/10.23645/epacomptox.25472164

Impact/Purpose:

Presentation to the Society of Toxicology (SOT) 63rd Annual Meeting and ToxExpo March 2024. This poster describes the development of a data provenance workflow for ToxValDB that supports improved data quality and transparency. This includes a real-world case study example to motivate the challenges of data curation and provenance. Successful data provenance and lineage mapping in ToxValDB will build confidence in broad applications across U.S. EPA, and its partners and stakeholders, modeling platforms and data-driven products.

Description:

Background and Purpose The U.S. Environmental Protection Agency’s (U.S. EPA) Toxicity Values Database (ToxValDB) is a compendium of toxicology information including human health reference values, cancer slope factors, ecological screening or effect levels, and quantitative dose metrics (e.g., points-of-departure LOAELs and NOAELs) from epidemiological and in vivo experimental animal studies. These data are aggregated from over 50 sources, including federal, state, and international agencies, industry groups, and academic institutions. Data are curated through expert manual and machine assisted workflows, with assessment relevant toxicity values and dose metrics standardized to common exposure units (e.g., mg/kg-day) across sources.   A major focus for ongoing ToxValDB efforts has been data provenance, which is crucial for facilitating confidence in the quality of information being leveraged in diverse decision-making contexts. Tracking the provenance is complex for products with thousands of data points or inputs. For example, one document may report a single finding, effect, or observation, but the details of the experiment that generated the result may be described in another, often more descriptive, document. Another issue that arises for data provenance is the presence of duplicates. Herein, we present two modules from our curation workflow with an accompanying case study to explain data provenance approaches in the ToxValDB context.   Methods The case study presented was based on the U.S. EPA’s Integrated Risk Information System (IRIS) Assessment database, which contains chemical toxicity information, study metadata, and derived human health toxicity values (e.g., non-cancer oral Reference Doses [RfDs] or inhalation Reference Concentrations [RfCs]). This dataset was chosen due to the multiple layers of inter- and intra- lineage connections across and within documents. The IRIS dataset was first programmatically curated from IRIS export files into the database following the standard ToxValDB data load process in R. Next, associated IRIS summary reports were downloaded from IRIS chemical webpages via Python scripting. The export files and summaries were stored with accompanying metadata and interrelationships. Documents were associated to database records by IRIS Chemical ID.   Manual curation was conducted to pull additional metadata and records for IRIS summary values. Inter-document lineage captured complex metadata and relationships between documents. Additional records included points-of-departures used to derive corresponding RfDs or RfCs and missing study metadata (e.g., subject species, sex, age) captured via intra-record linkages. If study metadata were not provided in reports, data were curated from the original study reference. To facilitate full data provenance, a record lineage was created by using a field to show how the records were associated across documents. Data profiling was performed to assess the effect of the new data provenance approach. This included numeric profiling and duplicate detection within and across sources.   Results We present results to show how analysis using the IRIS data may differ, depending on the associated data records that are available across the document layers. This will highlight the importance of transparency on what toxicological data is used and where it came from in decision contexts, including hazard prioritization, risk assessment, and science communication.   Conclusions ToxValDB is publicly available and regularly used to support the evaluation of chemicals under diverse decision contexts across the U.S. EPA, industry, academia, and other governmental and non-governmental entities. Successful data provenance and lineage mapping will build confidence for ToxValDB data for its use in broad research applications, modeling efforts, and data-driven products. This abstract does not necessarily reflect U.S. EPA policy.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/14/2024
Record Last Revised:03/25/2024
OMB Category:Other
Record ID: 360870