You are here:
Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology
Fox, EricW, R. Hill, S. Leibowitz, Tony Olsen, D. Thornbrugh, AND M. Weber. Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology. ENVIRONMENTAL MONITORING AND ASSESSMENT. Springer, New York, NY, 189:316, (2017).
This paper is of interest to ecologists and statisticians who use variable selection in the development of random forest models. The paper develops random forest models using variable selection for predicting the biological condition of rivers and streams in the conterminous USA. The models are estimated using stream condition data from the 2008/2009 National Rivers and Streams Assessment (NRSA), and a large suite of 212 landscape features from the EPA’s StreamCat Dataset. A major application is the development of national maps displaying the predicted probability of good biological condition for the population of 1.1 million stream reaches within the NRSA sampling frame. Due to the large number of predictor variables available in the StreamCat Dataset, the assessment of the effect of variable selection on the accuracy and stability of the random forest predictions is necessary for the construction of a defensible model. In this study, we compare a full variable set random forest model using all 212 StreamCat predictors, and a reduced variable set random forest model selected using a backwards elimination approach. We also assess how changes in the number of predictor variables affects the spatial and statistical patterns in the prediction maps. Ultimately, we found that the random forest models were robust to the inclusion of many variables of moderate to low importance, and that variable selection did not lead to any significant improvement in predictive performance. Further, the backwards variable elimination procedure exhibited numerous issues such as over-optimistic accuracy estimates and instabilities in the spatial prediction maps. In addition to validating modeling decisions for the development of the prediction maps, this paper more generally aims to provide methodological insights on random forest modeling with large ecological datasets. In addition, the results of this analysis is more robust predicted probabilities of stream condition. This research is conducted under SSWR 3.01B, and is one component of product 3.01B.1: “National maps of watershed integrity and stream condition and report and webinar describing these.”
Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological datasets there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used, or stepwise procedures are employed which iteratively add/remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating dataset consists of the good/poor condition of n=1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p=212) of landscape features from the StreamCat dataset. Two types of RF models are compared: a full variable set model with all 212 predictors, and a reduced variable set model selected using a backwards elimination approach. We assess model accuracy using RF's internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors, and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Further, the backwards elimination procedure tended to select too few variables, and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. The purpose of this work is to elucidate these issues of selection bias and instability to ecological modelers interested in using random forests to develop predictive models with large environmental datasets.
Record Details:Record Type: DOCUMENT (JOURNAL/PEER REVIEWED JOURNAL)
Organization:U.S. ENVIRONMENTAL PROTECTION AGENCY
OFFICE OF RESEARCH AND DEVELOPMENT
NATIONAL HEALTH AND ENVIRONMENTAL EFFECTS RESEARCH LABORATORY
WESTERN ECOLOGY DIVISION
FRESHWATER ECOLOGY BRANCH