Science Inventory

Random forest models for the probable biological condition of streams and rivers in the USA

Citation:

Fox, EricW, R. Hill, AND S. Leibowitz. Random forest models for the probable biological condition of streams and rivers in the USA. 2016 International Indian Statistical Association Conference, Corvallis, OR, August 18 - 21, 2016.

Impact/Purpose:

This abstract concerns the development of random forest models for the probable biological condition of rivers and streams within the USA. The application of these models is the creation of maps displaying the predicted probability that streams and rivers within the USA are in good condition. The maps may have very useful application towards watershed restoration and conservation efforts. We also discuss statistical issues involving variable selection techniques for random forest modeling, which are of great interest to ecological statisticians interested in using random forests for predictive modeling with large datasets. This is research conducted under SSWR 3.01B and is an extra product.

Description:

The National Rivers and Streams Assessment (NRSA) is a probability based survey conducted by the US Environmental Protection Agency and its state and tribal partners. It provides information on the ecological condition of the rivers and streams in the conterminous USA, and the extent to which they support healthy biological condition. An important problem is the prediction of stream condition at new, unsampled locations. Using random forests (Brieman, 2001) we develop a model to predict the probability that a stream is in good (or conversely poor) biological condition. The model is fit to categorical response data consisting of 1365 NRSA survey sites and their designation as being in good or poor condition according to the macroinvertebrate Multimetric Index (MMI). The predictor data consist of 212 landscape features from the EPA's Stream-Catchment Dataset (Hill et al., 2015). We find that the random forest model performs remarkably well according to various performance metrics (e.g., .75 accuracy; .71 sensitivity; .78 specificity; and .84 AUC). A major statistical issue which we address in this talk is whether or not to perform variable selection for this application of random forests. On one hand, random forest modeling is known to be robust to including many noisy input features; on the other hand, in ecology it is often desirable to build an interpretable model. Along these lines, we compare a random forest model with a reduced variable set selected using a backwards elimination approach with the full set model using all predictors. We find that the random forest model with no variable reduction and minimal tuning is surprisingly robust, and has similar cross-validated accuracy to the reduced set model. Moreover, the backwards elimination approach can lead to issues with selection bias and instability which we elaborate on in this talk.

Record Details:

Record Type:DOCUMENT( PRESENTATION/ SLIDE)
Product Published Date:08/21/2016
Record Last Revised:08/26/2016
OMB Category:Other
Record ID: 325450