Science Inventory

Datamining Relational Databases for Regression Analysis

Citation:

Harten, P., H. Helgen, AND W. Melendez. Datamining Relational Databases for Regression Analysis. QSAR 2021 International Workshop on QSAR in Environmental and Health Sciences, Virtual, NC, June 07 - 10, 2021. https://doi.org/10.23645/epacomptox.15070215

Impact/Purpose:

Presentation to the QSAR 2021 International Workshop on QSAR in Environmental and Health Sciences June 2021. Innovations in chemical and material design are rapidly changing the landscape of industrial and consumer products, including novel materials, such as engineered nanomaterials (ENMs), which are incorporated into products to enhance their performance. Emerging materials and technologies often have unique physicochemical properties, warranting specialized approaches for evaluating hazard and exposure. EPA staff in the Office of Chemical Safety and Pollution Prevention (OCSPP) are responsible for evaluating notifications or submissions proposing registration of novel materials and new intended uses for existing chemicals under Toxic Substances Control Act (TSCA) or Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA). Evaluation of novel materials, including ENMs, necessitates determinations of potential risks of intended uses specific testing requirements, if any, to support registration. Because ENMs have unique physical and chemical properties and may behave differently than traditional chemicals, the testing and the ultimate registration decisions must accommodate these variations. EPA program offices, faced with applications for novel engineered nanomaterials, need access to relevant data to help predict potential environmental/biological interactions of nanomaterials based on their physio-chemical properties and the intended uses of novel materials. A relational database containing the results from the Office of Research and Development (ORD) research regarding the actions of engineered nanomaterials in environmental and biological systems is currently being built. The database captures the chemical and physical parameters of the materials tested, the assays in which they were tested, and the measured results. A full database would also enable observed results to be predicted for similar nanomaterials using quantitative structure activity relationships and other sophisticated modeling approaches.

Description:

Datamining complicated MySQL relational databases and preparing this data for QSAR analysis in Python can challenge researchers. Sometimes, this entails more than a simple dump of records in CSV format with every row having all numerical descriptors and values to be predicted. Instead, many MySQL scripts and Python routines must be written and implemented to sufficiently refine data for machine learning methods. This presentation gives examples of the routines written to datamine the MySQL database EPA’s NaKnowBase and organize it into a CSV format file. These routines can be straightforwardly modified for researchers’ relational databases and QSAR applications. The views expressed in this abstract are those of the author and do not necessarily represent the views or policies of the US EPA.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ POSTER)
Product Published Date:06/10/2021
Record Last Revised:07/28/2021
OMB Category:Other
Record ID: 352424