Science Inventory

Identifying Genes Predictive of Breast Cancer-associated Chemicals through Machine Learning Analysis of High-Throughput Transcriptomic Screening Data across MCF7 Cells

Citation:

Koval, L., L. Everett, J. Harrill, R. Judson, AND J. Rager. Identifying Genes Predictive of Breast Cancer-associated Chemicals through Machine Learning Analysis of High-Throughput Transcriptomic Screening Data across MCF7 Cells. Presented at SOT 2024, Salt Lake City, UT, March 10 - 14, 2024. https://doi.org/10.23645/epacomptox.25565265

Impact/Purpose:

This work is being submitted to the SOT 2024 conference by outside collaborators making use of EPA-generated HTTr data.

Description:

Background and Purpose: Breast cancer is a highly prevalent disease estimated to affect 1 in 8 women in the U.S. Breast cancer risk is known to be heavily impacted by an individual’s environment, with studies supporting increased risk due to specific chemical exposures, though critical research gaps remain.  First, the majority of chemicals that humans are exposed to have not been evaluated for potential relationships to this disease outcome. Second, the biological mechanisms linking environmental chemical exposures to breast cancer etiology have not been fully established. New approach methodologies (NAMs) such as in vitro high-throughput screening (HTS), -omics technologies, and machine learning are useful tools to address these gaps. This study aims to leverage these NAMs-based tools to prioritize understudied chemicals in the environment based upon in vitro HTS-derived transcriptomic signatures that mimic signatures predictive of known breast cancer associated chemicals. These prioritized understudied chemicals were hypothesized to target the expression of genes involved in breast cancer etiology, including those regulating cell cycle, cell death, DNA integrity, and endocrine signaling.   Methods: Targeted RNA sequencing data were produced from HTS experiments using human breast cancer MCF7 cells exposed to hundreds of individual chemicals, including those of environmental relevance. These chemicals were binned into categories based upon existing breast cancer data, specifically: chemicals with known associations with breast cancer (and thus labeled BCs); chemicals with a demonstrated lack of relationship to breast cancer (non-breast cancer [NBCs]); and chemicals that remain understudied for this risk (understudied chemicals [UCs]). Machine learning models, spanning random forest (RF) and support vector machine (SVM), were trained on the transcriptomic data for BCs and NBCs, yielding models that predict whether a chemical’s transcriptomic profile is more similar to BCs or NBCs. Resulting models were then applied to the UCs with the goal of identifying chemicals that alter the same biological mechanisms as BCs.  Physicochemical properties were additionally evaluated as predictors alongside transcriptomic data. Biological interpretation was carried out through gene-specific analysis of top-ranking predictor variables as well as pathway-level analysis.   Results Of the chemicals tested in MCF7 cells, 44 were classified as BCs, 335 were classified as NBCs, and 636 were classified as UCs. The RF and SVM trained to predict BCs vs NBCs using transcriptomic signatures achieved an overall accuracy of 0.87 and 0.78, respectively. Fifty-seven genes were identified as strong predictors of BCs and NBCs across measures of feature importance for both models. Pathways relevant to DNA damage and endocrine signaling, specifically estrogen receptor-related pathways, were enriched within these predictor genes. Finally, inclusion of the physicochemical properties did not significantly alter overall model accuracy, though it improved model performance with respect to balanced accuracy, which addresses potential class imbalance limitations, for both RF and SVM models. Implementing these trained models resulted in the identification of 24 understudied chemicals predicted to target similar transcriptional alterations and thus represent chemicals with potential implications in breast cancer risk for further evaluation. These prioritized understudied chemicals include the biocides ethirimol, methylisothiazolinone, and quinalphos as well as the dye Allura Red C.I. 16035.   Conclusion: Collectively, this study addresses a critical gap towards understanding which chemicals in our environment may be impacting breast cancer risk by prioritizing understudied chemicals based on HTS data. Integration of HTTr and ML methodologies additionally yielded potential biomarkers of disease progression and elucidated pathways enriched in genes predictive of breast cancer.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:03/14/2024
Record Last Revised:04/08/2024
OMB Category:Other
Record ID: 361057