Grantee Research Project Results
Final Report: Machine Learning Toolkit for Grey Literature Screening
EPA Contract Number: 68HERC23C0033Title: Machine Learning Toolkit for Grey Literature Screening
Investigators: Mintas, Constantine
Small Business: VISIMO, LLC
EPA Contact: Richards, April
Phase: I
Project Period: December 1, 2022 through May 31, 2023
Project Amount: $99,804
RFA: Small Business Innovation Research (SBIR) Phase I (2023) RFA Text | Recipients Lists
Research Category: SBIR - Sustainability , SBIR - Air and Climate , SBIR - Water , SBIR - Homeland Security
Description:
VISIMO's Machine Learning (ML) Toolkit for Grey Literature Screening seeks to reduce bias and improve the accuracy of Systematic Reviews (SRs) for chemical risk assessments, while also decreasing time and resource strain. This solution will enable researchers to filter full documents of all types and formats, identify relationships between documents, determine source relevance, perform improved meta-tagging, and process new inputs in real time, which will enable the model to be tailored to the needs of each researcher. During the Phase I effort, VISIMO focused on proving the feasibility of a tool that can screen both academic and grey literature, particularly focusing on the varying formats and types of grey literature that are often barriers to an efficient and accurate SR process.
SRs are an extremely effective method for locating, appraising, and summarizing evidence, and they have significant potential to improve decision-making in chemical risk assessment by increasing the rigor, transparency, and objectivity of risk assessments. The accuracy and effectiveness of an SR depends largely on the comprehensiveness of the evidence included in the review. To reduce the risk of bias, SRs should involve both academic literature, defined as peer-reviewed journals that follow a consistent organizational structure (e.g., title, abstract, methods, results, and conclusions), and "grey" literature, or sources located outside of scientific journals (e.g., annual reports, theses, doctoral dissertations, white papers, website articles). While grey literature is important for inclusion, the filtering of grey literature has long posed a challenge for researchers. Grey documents can be quite extensive, and often lack the format and structure normally present in scientific journal articles. As a result, they can be difficult to metatag, and determining source relevance through abstracts and citations poses similar challenges.
Summary/Accomplishments (Outputs/Outcomes):
Phase I research included three major phases: 1) collecting data and preprocessing; 2) developing and building the model components; and 3) testing and fine-tuning the model's parameters. VISIMO partnered with Subject Matter Experts (SMEs) in the field of chemical risk assessments and systematic reviews to assist with the data collection effort: Dr. Joseph Bressler, Associate Professor of Environmental Health and Engineering at Johns Hopkins University, and Megan Kocher, Science and Evidence Synthesis Librarian from the University of Minnesota. The final dataset consolidated to train the ML components in the pipeline consisted of 500 academic papers collected from PubMed and 165 various grey literature sources. This volume of data was sufficient for both exploratory research and to ultimately prove the feasibility of our tool. The dataset also represented a typical split of academic and grey literature in an SR, demonstrating its applicability to the SR process.
After data collection, VISIMO began building each of the model's components, which included an embedding algorithm and a relevance sorting component. The embedding algorithm transforms plain English text into a numeric representation designed to capture the context of the language (i.e., semantics, word proximity, etc.). Once the model was built, we began testing each component of the model, training, and then fine-tuning performance based on metrics such as Topic Coherence and Topic Diversity. Topic coherence measures how similar the words or phrases contained within a given topic are to one another. An example of a topic with high coherence would be the words "lithium," "ion," and "battery." Topic diversity measures how "diverse" the extracted topics are (i.e., the number of unique words across all topics extracted from the documents). This demonstrates whether the topics cover a wide range of information or if they tend to be narrower and more focused. A relatively diverse set of topic words is preferred, and this metric helps determine whether the pipeline is creating clusters that are too specific or too broad. The model underwent iterative training and testing as we drove to optimize and improve the results.
Additionally, VISIMO developed wireframes that demonstrated the intended User Interface (UI), showing how a user will upload documents, filter and analyze respective relevance, and provide feedback to the tool. The wireframes were developed in conjunction with end-user customer discovery interviews and were shared with members of the EPA to test assumptions and gather feedback.
Each component of the model we tested also underwent randomization experiments with their default parameters. This identified an expected mean and standard deviation for each metric. Comparing the results of each optimization experiment to these baseline values indicated if the change in performance was statistically significant. For example, the preprocessing optimization explored how grouping together pages of the grey literature, which had no clearly defined sections, impacted the coherence of the clusters. Additionally, the embedding parameters were tested to determine which model architecture offered the best performance, which pretrained model offered the best starting point for transfer learning, and how long the transfer learning should be performed to produce the optimal embedding space for the text. The relevance sorting experiment evaluated the impact of weighing the training data by class frequency to ensure relevant recommendations could be produced, even when relevant literature is sparse in the dataset.
Although the topic diversity and coherence scores are useful in evaluating the intermediate stages of the pipeline, the final output is a list of documents ordered by predicted relevance to a user's search query. The key performance metrics measure the quality of these recommendations by evaluating the frequency of relevant documents. To start receiving recommendations, the user must upload a few documents tagged as relevant or not relevant to their topics. The model then uses the tagged documents to learn which documents are most relevant to the user's search topics and returns a set of documents sorted by predicted relevance to the search queries. While reviewing some of the most recommended documents, the user tags them as either relevant or not. The model then uses this information to improve the next set of recommendations. This process repeats until the model stops finding relevant documents to recommend. Because the relevance sorting algorithm is an iterative process, the evaluation process must also be iterative. Furthermore, we proved that as the researcher provides feedback on the documents, the percentage of relevant documents filtered toward to the top of the recommended list increases.
Conclusions:
VISMO's final metrics indicate that our ML tool produces accurate recommendations of documents relevant to the user's chosen topics. While performance varies between topics, the model consistently reduces the number of irrelevant documents a user must review. Additionally, the model's performance increases with continued interaction from the user, which will further reduce the time required to review documents in SRs. The tool is also input agnostic, allowing easy filtering of both grey and academic literature. The preprocessing and embedding models can accept both literature types and ingest them for the remaining stages of the pipeline to use. These results prove the tool's feasibility, and its ability to increase the efficiency of document relevance determination during the SR process.
During the Phase I period of performance, several members of the VISIMO team developed an initial commercialization plan and worked with our TABA provider to refine this plan. The team gained a deeper understanding of end users, policy considerations, the competitor landscape, and considerations in scaling to a broader user base. Through numerous interviews with policy regulators, EPA members, librarians, multidisciplinary researchers, and risk assessors, members of the VISIMO team captured accurate user stories and key metrics of success to harden the viability of this plan. Unlike competing tools on the market, VISIMO's tool can screen full texts rather than solely titles and abstracts, and efficiently ingest several types of documents simultaneously, sorting by relevance based on topic. This will ultimately reduce the time required for researchers to determine the relevance of large collections of grey literature. Though there are other tools that address literature screening in a generalized manner, VISIMO's solution is designed specifically to address the challenges inherent in screening grey literature.
While this tool is initially designed to meet EPA needs for chemical risk assessments, it can be adapted to other disciplines and users (i.e., scientists, researchers, librarians, and industry leaders in many different fields). This tool will increase the accuracy of SRs for chemical risk assessments by reducing bias and enabling a greater volume of sources, which will ultimately result in significant positive environmental impact through supporting the risk reduction and assessment of various chemicals.
SBIR Phase II:
Machine Learning Toolkit for Academic and Grey Literature ScreeningThe perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Conclusions drawn by the principal investigators have not been reviewed by the Agency.