Science Inventory

Using deep learning and active learning methods to streamline literature curation for the ECOTOX Knowledgebase

Citation:

Howard, B., R. Shah, A. Tandon, A. Merrick, J. Olker, C. Elonen, AND D. Hoff. Using deep learning and active learning methods to streamline literature curation for the ECOTOX Knowledgebase. SETAC North America, Toronto, ON, CANADA, November 03 - 07, 2019.

Impact/Purpose:

The ECOTOX Knowledgebase is a comprehensive, publicly available application providing chemical environmental toxicity data on aquatic life, terrestrial plants and wildlife compiled from over 48,000 references covering over 11,000 chemicals and over 12,000 species. ECOTOX data are used for all ecological risk assessments supporting pesticide registrations and re-registrations, all ambient water quality criteria for chemicals published since 1985, site-specific water quality criteria (by EPA Regions, States, and Tribes), and assessments used in emergency response. ECOTOX has established standard operating procedures that meet requirements for Agency systematic reviews of available information for use in Agency decision making. Presently, the literature review and data extraction processes are manually completed; however, development of more efficient data mining tools will ultimately lead to more informed predictive tools. This presentation describes an effort to use machine learning methods to automatically identify relevant documents and develop a customized software application for screening literature. By increasing efficiencies in identifying, obtaining, reviewing and encoding data for the user interface of ECOTOX, we will be able to quickly identify and curate ecotoxicological data to meet Program offices’ needs, as well as for use by State and tribes to determine thresholds and conduct risk assessments.

Description:

The ECOTOXicology Knowledgebase (ECOTOX) is a comprehensive, publicly available knowledgebase providing single chemical environmental toxicity data on aquatic life, terrestrial plants and wildlife. The ECOTOX database (as of March 2019) contains data for 11,695 chemicals and 12,713 species manually extracted from 48,464 references. The database is updated quarterly, and to identify relevant references and extract pertinent data, the ECOTOX data curation pipeline employs a methodical, multi-step process roughly equivalent to the initial stages of systematic review. This labor-intensive workflow requires human curators to regularly evaluate tens of thousands of candidate references, the majority of which are then rejected as not relevant. To streamline this process, we have recently evaluated the feasibility of using machine learning methods to automatically classify documents according to relevance, and to identify the exclusion rationale for those references which are excluded. Using a massive historical database containing hundreds of thousands of manually-screened references, we train a deep learning, neural language-model classifier to predict the relevance of new candidate references. References designated as excluded are further classified according to exclusion rationale and, using an attention mechanism built into the deep learning classifier, supporting passages are highlighted in the abstract. These models serve as a baseline classification method, subject to human intervention, which is then refined for each chemical-centric batch of new candidate articles according to user feedback within an active learning framework. Our approach is operationalized in the form of a modified version of the SWIFT-Active Screener software application, a collaborative web-based reference screening platform. We anticipate that deployment of this tool as part of the ECOTOX data curation pipeline will result in more than a 50% reduction in the time spent screening references for relevance.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:11/07/2019
Record Last Revised:11/25/2019
OMB Category:Other
Record ID: 347578