Jump to main content.


Analyzing Data

pixel.gif
 This image is a drawing of a caddisfly larva in its case. Caddisflies are aquatic insects that are used by biologists to monitor the environmental quality of streams.


M.8. Classification and Regression Trees

M.8.1. What is Classification and Regression Tree Analysis?

Classification and regression tree (CART) analysis recursively partitions a matched data set of categorical variables (for classification trees) or continuous variables (for regression trees) into progressively smaller groups, using binary splits based on single independent or predictor variables (De'ath and Fabricius, 2000; Prasad et al., 2006).

CART analysis constructs a set of decision rules with the independent variables. During each recursion, splits for each independent variable are examined and the split that maximizes the homogeneity of the two resulting groups with respect to the dependent variable is chosen. A typical output from these analyses is shown below (Figure M.8-1).

CART example
Figure M.8-1. A tree diagram for relative abundance of lithophilous fish (i.e., fish that broadcast spawn on gravel beds) with respect to % sand and fines (% S&F, a measure of the candidate cause fine bedded sediment), and watershed area (WA) and ecoregion (normalization or classification variables). Branches are annotated showing the decision rules (e.g. % sand and fines < 22.3). Nodes are annotated showing the mean of the dependent variable (n = number of observations, x = mean value, MSE = mean squared error). Data set provided by the Minnesota Pollution Control Agency.

M.8.2. How Do I Use Classification and Regression Tree Analysis in Stressor Identification?

In general, CART can be applied effectively to the SI process in two ways: in the classification or normalization of data and in the development of stressor-response relationships from data from other field studies. These two areas are summarized below.

CART for Classification and Normalization

CART analysis is used in data exploration to classify systems that differ due to natural causes (see Classifying Sites and Normalizing Data). Often, classification is needed to more clearly reveal stressor-response relationships. The algorithm used in CART simplifies or “prunes” the tree that contains all possible splits of the data to an optimal tree that contains a sufficient number of splits to describe the data. CART may be used to determine the relative importance of classifying or normalizing variables for identifying homogeneous groups within the data set, if environmental parameters intended to classify the data (such as ecoregions) or normalize the data in relation to naturally occurring gradients (such as stream size) are included in the model.

Usually, CART users are most interested in the variables selected by the model for the first few splits. In Figure M.8-1, % sand and fines was the variable selected for the first split, but that variable is a potential candidate cause so it would not be used for classification. Drainage area, a natural variable that would not be a candidate cause, was the variable selected for the second split. Based on this second split, one might investigate classifying these sites based on drainage area into sites greater than or less than about 40 km2.

Step 4: Evaluating Data from Elsewhere

 Scatter Plot with threshold
Figure M.8-2. Scatter plot of % sand and fines and % of lithophilous fish. Based on the CART analysis depicted in Figure M.8-1, observations with % sand and fines < 22.3% are plotted as closed circles, and observations with % sand and fines > 22.3 % are plotted as open circles. Linear regression (solid lines) and 90th-percentile quantile regression (dashed lines) reveal different slopes and intercepts for the two categories.

CART can be used to provide evidence for stressor-response relationships from other field studies by identifying the levels of the candidate cause at which its functional relationship with the biological response changes. This application may be used to help identify inflection points or nonlinearities in a stressor-response relationship, if the environmental measurements representing the candidate cause are included in the model. Apparent change points then can be investigated using other techniques (e.g., regression analysis, conditional probability analysis) to determine whether they represent thresholds or other change points in the stressor-response relationship. For example, the previous CART analysis (Figure M.8-1) identified a split in the data set at % sand and fines = 22.3%. Regression analyses demonstrate that two groups are best described by different models: the y intercepts of the mean regression line and both the intercept and slope of the 90th-percentile line decreased for sites where the percentage of sand and fines exceeded 22.3% (Figure M.8-2). After the model is derived, it would be interpreted in the same way as the results from regression or quantile regression analyses.

M.8.3. Can I Use Classification and Regression Tree Analysis with My Data?

Unlike linear regression techniques, CART analysis does not assume a particular form of relationship between the independent and dependent variables. Therefore, CART can often be used even in cases where data are not suitable for analysis by linear regression. The objective of CART is to create a decision tree that predicts the characteristics of the population of sites being studied. Therefore, the more sites (i.e., examples) presented to the algorithm, the better it will probably predict the characteristics of the population.

The CART algorithm available in CADStat from CADDIS' Get Data Analysis Tools page can be used for either classification or regression trees, and handles mixed models containing both categorical and continuous variables.

M.8.4. Helpful Tips


References

De'ath, G; Fabricius, KE. (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81(11):3178-3192.

Prasad, AM; Iverson, LR; Liaw, A. (2006) Random forests for modeling the distribution of tree abundances. Ecosystems 9:181-199.


Data Analysis Methods Home    Previous Page    Next Page


Local Navigation


Jump to main content.