Science Inventory

Data Profiling and Quality Control Pre-Screening of Toxicology Data in ToxValDB

Citation:

Tague, M., A. Brito, R. Judson, E. Rowan, T. Wall, AND R. Sayre. Data Profiling and Quality Control Pre-Screening of Toxicology Data in ToxValDB. SOT, Nashville, TN, March 19 - 23, 2023. https://doi.org/10.23645/epacomptox.22280572

Impact/Purpose:

This project develops a method to scope data in ToxValDB to identify likely errors and duplicates. The methods may be applicable to other toxicology data.

Description:

The U.S Environmental Protection Agency’s (US EPA) Toxicity Values Database (ToxValDB) is a quantitative database containing toxicology information including reference doses, screening levels, and quantitative values from in vivo toxicology studies. These data are aggregated from over 50 sources, including state, national, and international agencies, industry groups, and academic institutions. Data are incorporated into the database through both manual and machine extraction, with dose metrics (e.g., LOAEL, RfD) and units normalized across the sources.     In the current process, a domain specialist reviews a summary of each extraction into ToxValDB for relevance and completeness. Then, 20% of the records undergo manual review in the Data Accuracy Tool (DAT), a tool designed to facilitate quality assurance workflows. These records are scanned by a generalist and a a domain specialist, making changes if needed to make the record match the source document, logging an audit trail. However, despite this QC process, some errors still occur in the database.    Due to the potential for errors in both manual extraction and within the source documents, as well as the potential for normalization errors from heterogeneous sources, establishing quality control metrics by which data records may be automatically identified and flagged for manual review is crucial. In addition to these complications, data present in ToxValDB include both hazard values (the species and route specific level of exposure to a chemical that can cause harm) and risk values (the likelihood of harm for a given exposure). Inclusion of both kinds of terms may lead to high data heterogeneity. Furthermore, many types of dose metrics only take discrete values, requiring care in data analysis. To this end, we present here a data profiling framework, analyzing the data within ToxValDB to find statistical patterns, assure data accuracy, alleviate downstream errors, and reduce reviewer burden.    Quality control rules were developed in two phases. The first phase screened for “common sense violation” records, such as those that had multiple or incorrect units, or values that did not make physical sense. Second phase quality control rules continued this logic, considering each combination of chemical, dose metric and units to determine unusually high or low values, whether by comparison to the distribution of that combination or by using known values for structurally related chemicals as benchmarks. Using these QC rules, an application is being developed to provide the results of these measures and to allow for automated screening of newly entered data, reducing manual work in the future. Any records flagged by these metrics were then assigned to a manual quality control review through DAT. Manual review determines if values are accurate to the original source document; incorrectly entered records are manually replaced. The percentage of data replaced or cleared post-screening will be used to update QC rules.     Phase 1 rules have flagged about 17,000 records for being duplicates, 4,000 record for data formatting, and 150 for exceedingly large values, of about 320,000 total records. Phase 2 rules have flagged about 13,000 further records for outlier values. These records are currently under manual review to remove downstream errors. ToxValDB data are publicly available and are regularly used throughout the US EPA to support the prioritization and estimation of chemical safety values, in addition to being used also widely used across industry, academia, and other governmental and non-governmental entities.  This abstract does not necessarily reflect U.S. EPA policy. 

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/23/2023
Record Last Revised:04/14/2023
OMB Category:Other
Record ID: 357609