Grantee Research Project Results
Final Report: Automated Systematic Reviews for Chemical Risk
EPA Contract Number: 68HERC22C0033Title: Automated Systematic Reviews for Chemical Risk
Investigators: Minton, Steven
Small Business: InferLink Corporation
EPA Contact: Richards, April
Phase: I
Project Period: December 1, 2021 through May 31, 2022
Project Amount: $100,000
RFA: Small Business Innovation Research (SBIR) Phase I (2022) RFA Text | Recipients Lists
Research Category: Small Business Innovation Research (SBIR)
Description:
In this Phase I project we designed a software system to generate systematic reviews of scientific literature on chemical risks. The technology has the potential to greatly improve on currently available tools because it automates the full process of literature review. Unlike existing technologies that simply help humans find and filter relevant articles, our approach uses state-of-the-art neural networks to find, filter, extract, summarize, and aggregate research results.
Risk assessments, and in particular systematic literature reviews, are an important tool to inform decision making and policy. Our system will allow the EPA, and other organizations, to make sound decisions based on accurate and up-to-date chemical risk assessments. The use of AI enables systematic reviews to be produced much more rapidly than is possible with a fully manual approach, so that a great deal more can be accomplished with the same resources. In addition, because the technology can be carefully evaluated on test data, we can be confident (statistically speaking) that any assessments using the technology will be high quality, transparent, consistent, and scientifically defensible.
Our research in Phase I built upon our previous work in natural language processing and machine learning, in which we developed and commercialized software to automate systematic reviews of clinical studies. We showed that this technology can be repurposed and extended to accurately extract chemical risks data from research literature. To do so we developed a proof-of-concept information extraction system that determines whether an article is an epidemiological study of chemical exposures in humans, and if it is, what chemical exposures, health outcomes, and population sizes were studied. Since extraction and filtering information is generally the most challenging and time-consuming part of generating a systematic review, our work demonstrates that we have a strong foundation for Phase II, where we plan to implement an end-to-end system for automated systematic reviews.
Summary/Accomplishments (Outputs/Outcomes):
A key aspect of our work in Phase I involved generalizing techniques we had developed and used in past work for rapidly training machine learning models. We were able to distill a general methodology for fast training, which represents a significant step forward in unifying several ad-hoc strategies that we had previously used. The methodology employs two related techniques to iteratively build a set of related models for extracting data from articles, which we refer to as model-assisted annotation and model constellations.
After consultation with EPA and commercial stakeholders, we focused on developing a proof-of-principle system to extract information from epidemiological studies of chemical exposures in humans. We used model-assisted annotation to quickly generate a dataset of articles with the chemical exposures, health outcomes, and population sizes labeled. This approach allowed us to reduce what could have been several months of annotation by multiple people into a two-week long effort by a single person.
Using this dataset, we generated a system that could, with precision and recall in the 80-99% precision/recall range identify whether a research article was an epidemiological study of chemical exposures in humans, and it if was, what chemical exposures, health outcomes, and population sizes were part of the study. This system achieved these results using the model constellation approach, in which a group of simple models are developed in in concert to solve a complex information extraction task
We evaluated the information extraction system on articles that had been manually selected in real systematic reviews, essentially testing whether we could "replicate" the reviews by extracting the necessary information. We showed that the information was extracted from these articles with 89-99% precision/recall, establishing that the technology has the potential to yield high quality and scientifically defensible data.
As part of our Phase I work we also gathered requirements and developed a design for an end-to-end system that will use the proof-of-concept models described above to automatically generate systematic reviews. The techniques we've validated, along with the system requirements and design that we have established in Phase I will allow us to effectively implement the system during our Phase II work.
Conclusions:
The work conducted in Phase I of this proposal has shown that our model-assisted annotation and orchestrated models machine learning techniques, which we had previously successfully applied to clinical studies, can be efficiently and effectively used in new technical domains. We developed a working proof-of-concept system for extracting information from epidemiological studies of chemical exposures in humans. The system was successfully tested by evaluating its recall and precision on test data sets, and in addition, evaluating the system on results from existing systematic reviews. With high quality information filtering and extraction shown to work we will be able to summarize, and aggregate chemical risk literature to generate an end-to-end automatic systematic review system in Phase II.
The technology we develop will be valuable to both government as well as businesses that rely on accurate science-based chemical risk reviews to make decisions. During Phase I, we met with potential partners and customers to validate the market. In particular, we identified a partner willing to provide matching funds for the Phase II Commercialization Option
The approach we have planned out will allow us to provide a Software as a Service offering to inform decision making based on the collection and analysis of chemical risk scientific data. Markets that would have a need for our technology include the casualty insurance risk analytics market (which inform underwriting of environmental insurance and related lines) and the Environmental, Social and Governance (ESG) data market (which informs investors and Boards about evaluations of the environmental impact of companies). Both fast-growing markets are estimated to have an as-yet-unrealized potential market size over $1B.
The perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Conclusions drawn by the principal investigators have not been reviewed by the Agency.