Analyzing Data
M.8. Classification and Regression Trees
M.8. Classification and Regression Trees
- 1. What is classification and regression tree analysis?
- 2. How do I use classification and regression tree analysis in Stressor Identification?
- 3. Can I use classification and regression tree analysis with my data?
- 4. Helpful tips
- Authors
- M.B. Griffith
- All CADDIS authors, contributors, and reviewers
Links to Methods
- Click to Expand/Collapse
M.8.1. What is Classification and Regression Tree Analysis?
Classification and regression tree (CART) analysis recursively partitions a matched data set of categorical variables (for classification trees) or continuous variables (for regression trees) into progressively smaller groups, using binary splits based on single independent or predictor variables (De'ath and Fabricius, 2000; Prasad et al., 2006).
CART analysis constructs a set of decision rules with the independent variables. During each recursion, splits for each independent variable are examined and the split that maximizes the homogeneity of the two resulting groups with respect to the dependent variable is chosen. A typical output from these analyses is shown below (Figure M.8-1).
M.8.2. How Do I Use Classification and Regression Tree Analysis in Stressor Identification?
In general, CART can be applied effectively to the SI process in two ways: in the classification or normalization of data and in the development of stressor-response relationships from data from other field studies. These two areas are summarized below.
CART for Classification and Normalization
CART analysis is used in data exploration to classify systems that differ due to natural causes (see Classifying Sites and Normalizing Data). Often, classification is needed to more clearly reveal stressor-response relationships. The algorithm used in CART simplifies or “prunes” the tree that contains all possible splits of the data to an optimal tree that contains a sufficient number of splits to describe the data. CART may be used to determine the relative importance of classifying or normalizing variables for identifying homogeneous groups within the data set, if environmental parameters intended to classify the data (such as ecoregions) or normalize the data in relation to naturally occurring gradients (such as stream size) are included in the model.
Usually, CART users are most interested in the variables selected by the model for the first few splits. In Figure M.8-1, % sand and fines was the variable selected for the first split, but that variable is a potential candidate cause so it would not be used for classification. Drainage area, a natural variable that would not be a candidate cause, was the variable selected for the second split. Based on this second split, one might investigate classifying these sites based on drainage area into sites greater than or less than about 40 km2.
Step 4: Evaluating Data from Elsewhere
CART can be used to provide evidence for stressor-response relationships from other field studies by identifying the levels of the candidate cause at which its functional relationship with the biological response changes. This application may be used to help identify inflection points or nonlinearities in a stressor-response relationship, if the environmental measurements representing the candidate cause are included in the model. Apparent change points then can be investigated using other techniques (e.g., regression analysis, conditional probability analysis) to determine whether they represent thresholds or other change points in the stressor-response relationship. For example, the previous CART analysis (Figure M.8-1) identified a split in the data set at % sand and fines = 22.3%. Regression analyses demonstrate that two groups are best described by different models: the y intercepts of the mean regression line and both the intercept and slope of the 90th-percentile line decreased for sites where the percentage of sand and fines exceeded 22.3% (Figure M.8-2). After the model is derived, it would be interpreted in the same way as the results from regression or quantile regression analyses.
M.8.3. Can I Use Classification and Regression Tree Analysis with My Data?
Unlike linear regression techniques, CART analysis does not assume a particular form of relationship between the independent and dependent variables. Therefore, CART can often be used even in cases where data are not suitable for analysis by linear regression. The objective of CART is to create a decision tree that predicts the characteristics of the population of sites being studied. Therefore, the more sites (i.e., examples) presented to the algorithm, the better it will probably predict the characteristics of the population.
The CART algorithm available in CADStat from CADDIS' Get Data Analysis Tools page can be used for either classification or regression trees, and handles mixed models containing both categorical and continuous variables.
M.8.4. Helpful Tips
- Focus on the first few splits of the data.
- Splits in classifying or normalizing variables may identify more homogeneous groups within the data set.
- Splits in the biological response variable may identify inflection points or nonlinearities in a stressor-response relationship.
De'ath, G; Fabricius, KE. (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81(11):3178-3192.
Prasad, AM; Iverson, LR; Liaw, A. (2006) Random forests for modeling the distribution of tree abundances. Ecosystems 9:181-199.
Data Analysis Methods Home Previous Page Next Page
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)