Science Inventory

Comparison of supervised vs unsupervised applicability domain measures

Citation:

Martin, T., Nate Charest, AND A. Williams. Comparison of supervised vs unsupervised applicability domain measures. SOT, Nashville, TN, March 19 - 23, 2023. https://doi.org/10.23645/epacomptox.22064690

Impact/Purpose:

Quantitative structure-activity relationship (QSAR) models provide an automated method for the estimation of all types of chemical safety relevant endpoints for data poor chemicals.  To provide robust QSAR models to inform chemical evaluation, it is important to adopt a set of modeling best practices (e.g., the OECD QSAR framework), as well as clearly define domain of applicability approaches. In addition, there is a need to investigate cheminformatics approaches to model management and versioning to enable real-time model predictions and data provenance. The Output may include development of automated workflows to transform raw experimental data to modeling data sets and then to QSAR models. The endpoints should be consistent with Agency priorities, and may include the prediction of toxicities, in vitro bioactivities (HTT), toxicokinetics (RED), and environmental fate and physicochemical properties to support exposure modeling (RED, ETAM). Where feasible, the predictive performance of models should be compared with current models being used by the program offices to ensure fit for purpose application. Finally, this Output may include research into the interplay between dataset attributes (e.g., size, noisiness, curation level, source disparities) and model quality (predictive performance) to better estimate the uncertainty of the predictions and to provide guidance in improving  QSAR modeling strategies in the future.

Description:

Proper selection of analogs for applicability domain (AD) calculations or read across predictions is a subject of intense debate. For example, one can define similarity using a complete set of descriptors (i.e. unsupervised learning) or by using descriptors that appear in a model for a specific endpoint (i.e. supervised learning). Calculations were performed to determine whether AD measures based on supervised learning outperform measures based on unsupervised learning. Different descriptor sets were utilized to determine the optimal descriptor set for unsupervised learning. For example, T.E.S.T. (Toxicity Estimation Software Tool) descriptors include both integer fragment counts and whole molecule descriptors whereas ToxPrint descriptors only include chemical fingerprints in terms binary or integer counts. The performance was evaluated based on the test set prediction accuracy at a fixed prediction coverage (fraction of chemicals inside the applicability domain) for a series of physicochemical properties and toxicity endpoints.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:03/23/2023
Record Last Revised:04/14/2023
OMB Category:Other
Record ID: 357603