Science Inventory

Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?

Citation:

Martin, T. M., P. Harten, D. M. Young, E. N. Muratov, A. Golbraikh, H. Zhu, AND A. Tropsha. Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? Journal of Chemical Information and Modeling. American Chemical Society, Washington, DC, 52(10):2570-2578, (2012).

Impact/Purpose:

The main goal was to determine if using rational design methods to split data sets into training and prediction sets yield more predictive QSAR models (or just give an overly optimistic estimate of predictive ability). This information is important for the Toxicity Estimation Software Tool (TEST) effort because we wanted to know what was the best way to split the data sets into training and test sets.

Description:

Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external dataset, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall dataset is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide datasets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall dataset was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity and Sphere Exclusion algorithms were used as the rational division methods. The Hierarchical Clustering, Random Forest and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.

Record Details:

Record Type:DOCUMENT( JOURNAL/ PEER REVIEWED JOURNAL)
Product Published Date:10/22/2012
Record Last Revised:03/28/2013
OMB Category:Other
Record ID: 252653