Science Inventory

After Correcting for “Concept Drift,” Deep-Learning Methods Can Now Achieve Human-Level Performance When Predicting Article Exclusion Reasons During ECOTOXicology Knowledgebase Curation

Citation:

Howard, B., C. Norman, A. Tandon, R. Shah, J. Olker, AND D. Hoff. After Correcting for “Concept Drift,” Deep-Learning Methods Can Now Achieve Human-Level Performance When Predicting Article Exclusion Reasons During ECOTOXicology Knowledgebase Curation. Society of Toxicology (SOT) 62nd Annual Meeting and ToxExpo, Nashville, TN, March 19 - 23, 2023. https://doi.org/10.23645/epacomptox.24749343

Impact/Purpose:

The ECOTOX Knowledgebase is a comprehensive, publicly available application providing chemical environmental toxicity data on aquatic life, terrestrial plants and wildlife. ECOTOX data has been compiled over more than 30 years and currently includes over 50,000 references covering over 12,000 chemicals and over 13,000 species. Data from ECOTOX are used for all ecological risk assessments supporting pesticide registrations and re-registrations, all ambient water quality criteria for chemicals published since 1985, site-specific water quality criteria (by EPA Regions, States, and Tribes), and assessments used in emergency response. ECOTOX has established standard operating procedures that meet requirements for Agency systematic reviews of available information for use in Agency decision making. The development and adoption of more efficient literature search and review methods and data mining tools will ultimately lead to more informed predictive tools.   This poster describes the development of machine learning methods to automatically identify relevant documents and develop a customized software application for screening literature. By increasing efficiencies in identifying, obtaining, reviewing and encoding data for the user interface of ECOTOX, we will be able to quickly identify and curate ecotoxicological data to meet Program offices’ needs, as well as for use by State and tribes to determine thresholds and conduct risk assessments.

Description:

The ECOTOXicology Knowledgebase (ECOTOX) is a comprehensive, publicly available resource providing single chemical environmental toxicity data on aquatic life, terrestrial plants, and wildlife. The database is updated quarterly, and to identify relevant references and extract pertinent data, the ECOTOX data curation pipeline employs a methodical process similar to the initial stages of systematic review. This labor-intensive workflow requires curators to regularly evaluate tens of thousands of candidate references, the majority of which are then rejected as not relevant. After the careful review of hundreds of thousands of articles, the ECOTOX database currently (as of December 2022) contains data for 12,714 chemicals and 13,806 species extracted from 53,763 references. The availability of this extensive dataset of historical screening decisions has provided us with the opportunity to develop state-of-the-art neural network classifiers to partially automate title and abstract screening and to categorize (e.g., human health, fate, chemical methods) rejected references. While initial proof-of-concept results from these models were very encouraging, we recently noticed that the “meanings” of several of the rejection categories have evolved over time due to concept drift, and that certain category labels had been added or removed from current usage. Therefore, in order to be more representative of future screening tasks, we have recently collected new dual screening decisions for a sample of 5,638 abstracts. Using this refined dataset, we trained a neural network classification model on the modified exclusion categories and demonstrated that it can accurately predict the various reasons an irrelevant article should be excluded from the ECOTOX database. While the performance of the model varies depending on the reason for exclusion, the improved method achieves a micro-averaged F1 score of .7535 overall. Furthermore, since human screeners do not always agree, it is possible to compare the congruence between individual human screeners and between human screeners and model predictions. The resulting Cohen’s Kappa scores demonstrate that the model predictions now perform at about the level of an average human screener, with some screeners consistently outperforming the model and other screeners underperforming. The latest model has now been integrated into the EcoTox version of the SWIFT-Active Screener software and is being used regularly as part of the EcoTox literature curation pipeline at EPA. So far almost 400,000 candidate EcoTox abstracts have been uploaded into Active Screener, and of these, 292,000 were eliminated from screening, saving more than 73% of the effort otherwise required. As we conclude this phase of the project, our focus is now shifting to automation of screening and data extraction of the full texts from the references that remain after title and abstract screening has completed. This abstract does not necessarily reflect the views or policy of the US EPA.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/23/2023
Record Last Revised:12/05/2023
OMB Category:Other
Record ID: 359727