Jump to main content.


Analyzing Data

pixel.gif
 This image is a drawing of a caddisfly larva in its case. Caddisflies are aquatic insects that are used by biologists to monitor the environmental quality of streams.


M.5. Regression Analysis

M.5.1. What is Regression Analysis?

Regression analysis is a method for quantifying the relationship between a dependent (response) variable and one or more independent (explanatory) variables.  These quantitative models can then be used to predict the value of the response variable for new values of the explanatory variables or to estimate the value of an explanatory variable needed to account for a change in the response variable.

Many different types of regression analysis are available. Distinctions between two types of regression are particularly relevant to the analysis of biological and environmental data and are discussed here.

Parametric vs. Nonparametric Regression

 A linear regression of metals vs. EPT taxa richness
Figure M.5-1. EPT taxon richness vs. metals toxic units, a measure of aggregate metals concentrations. The solid line is the linear regression model fit to the data. Dashed lines represent the 95% confidence intervals and dotted lines represent the 95% prediction intervals.

In parametric regression analysis, one explicitly specifies the functional form for the relationship between the response and explanatory variables. Then, regression is used to estimate the best values for the parameters of the model. For example, one might specify a simple linear model as follows:

 linear regression formula

where z is the observed richness of Ephemeroptera, Plecoptera, and Trichoptera (EPT) taxa at a site, x is a measure of the total metals concentration expressed as total toxic units, and ε is random sampling error. (Data for this example were collected by the EMAP Colorado Streams Assessment.) Then, b0 and b1 are parameters that would be empirically estimated by regression. In this example, regression estimates of the parameter values are b0 = 15.7 and b1 = -7.51 (Figure M.5-1). The region between the dashed lines represents the 95% confidence interval, and the region between the dotted lines is the 95% prediction interval.

nonparametric curve fit of total richness and agriculture
Figure M.5-2. Total taxon richness versus percent catchment agriculture in the Mid-Atlantic Highlands. The dashed line shows a linear regression model fit to the data; the solid line shows a nonparametric, loess regression model fit to the data.

In nonparametric regression analysis both the functional form and the parameter values are estimated from the data.  The model is only constrained a priori by a “smoothness” parameter that specifies some maximum degree of variability for the fitted curve.  Some commonly used nonparametric regression techniques include loess regression, and classification and regression trees.  More data are generally required for nonparametric regressions because both model parameters and structure are estimated from the data.  In many cases the functional form of the relationship between response and explanatory variables is not known, so nonparametric regressions can provide useful information (Figure M.5-2).

Simple Linear Regression vs. Generalized Linear Regression

Both simple linear and generalized linear regression assume that the response variable is a linear function of the model parameters. However, simple linear regression also assumes that the response variable is continuous and normally distributed, and many types of biological data do not satisfy these two assumptions. For example, total taxon richness is a counted variable and only occurs as integer, non-negative values. Similarly, relative abundance is a proportion and is constrained to values ranging from zero to one. Response variables can occasionally be transformed to more normal distributions. However, generalized linear regression allows one to directly model non-normally distributed responses and can often provide more realistic and easily interpreted models of the relationships between biological responses and explanatory variables.

In the example shown below, a simple linear model describing the relationship between the EPT proportion of total taxa and total nitrogen predicts negative values of the response variable when log(total nitrogen) exceeds 4 (10000 μg/L) (Figure M.5-3). In contrast, the predicted values of the response variable approach zero asymptotically at high nitrogen concentrations when modeled with generalized linear regression.

example use of glm
Figure M.5-3. EPT proportion of total taxon richness plotted versus log total nitrogen. Solid line: generalized linear regression fit to the data, dashed line: simple linear regression fit.

M.5.2. How Do I Use Regression Analysis in Stressor Identification?

Regression analysis can help visualize the relationships between different stressor variables and the biological response. Scatter plots with superimposed regression curves can therefore help identify candidate causes and help determine whether a particular candidate cause exhibits a stressor-response relationship that supports the case for that stressor.

Regression analysis is also used whenever one needs to quantify the relationship between two or more observed variables. As such, regression analysis is the foundational method for many, more involved analytical tools. For example, regression analysis is used to estimate the distribution of the tolerances of different species to a particular toxicant (see Species Sensitivity Distributions), to help control for natural variability (see Normalizing Data) in biological and environmental variables, and to estimate taxon-environment relationships (see Predicting Environmental Conditions from Biological Observations).

Step 3: Evaluating Data from the Case

Regression analysis of data from the case can provide evidence of spatial/temporal co-occurrence by establishing whether a potential stressor is elevated at the impaired site. However, some environmental factors vary strongly due to natural causes, and these natural variations must be considered before determining whether increased values at an impaired site are comparable to natural expectations. For example, stream temperature in Oregon varies strongly across an elevation gradient, so we use regression analysis to quantify the relationship between stream temperature and elevation at least-impaired reference sites (Figure M.5-4). Then, we can compare the observed temperature at the impaired site with the temperature predicted by the regression model. In the example shown here, observed stream temperature at Site A is much warmer and is located outside the 95% prediction interval of the regression model. Thus, the candidate cause, elevated temperature, co-occurs with the biological impairment at this site. Conversely, the observed temperature at Site B falls within the prediction interval, and therefore, elevated temperature does not co-occur with the impairment at this site. Temperature may also vary over time, so the possibility that elevated temperature preceded a biological response should be considered when evaluating spatial/temporal co-occurrence.

spatial co-occur example
Figure M.5-4. Seven day average maximum temperature (7DAMT) plotted versus elevation in reference sites in Oregon. The solid line shows the linear regression fit to the data; dashed lines show the 95% prediction intervals. Points labeled “A” and “B” show observed 7DAMT and elevation at two hypothetical impaired sites.

Step 4: Evaluating Data from Elsewhere

Regression analysis may also be used to generate evidence for stressor-response relationships from laboratory data, and stressor-response relationships from other field studies.  Regression analysis is frequently used to fit a quantitative relationship between observed biological responses to different levels of exposure to a stressor in laboratory studies and in field manipulations.  Examples of these types of models can be found in the Metals Chronic Concentration-Response Gallery and the Metals Species Sensitivity Distribution Gallery, which are both found in the CADDIS Databases section.  In some SI cases, these models can then be used to predict the expected magnitude of the biological response at the impaired site, given the observed stressor level.  The EPA's Benchmark Dose Software was specifically designed for regression analysis of laboratory data.  It fits 17 functions to exposure-response data and provides confidence bounds and multiple statistics for comparison of alternative fitted models.

In certain cases, regression analysis of data from other field studies can provide a quantitative model that estimates the level of biological impairment one would expect, given a certain level of stressor. Such a model can then be used to establish whether the degree of impairment observed at the site of interest is comparable to that observed in other field studies. In general, many different factors can influence biological condition at sites in other field studies. Hence, this type of evidence will usually require a multivariate model that incorporates the effects of the most relevant environmental factors.

M.5.3. Can I Use Regression Analysis with My Data?

Regression analysis can be applied to virtually any type of data. Data for response and explanatory variables should be matched in space and time, although the means by which this matching is accomplished will differ for different data sets and different environmental and biological variables. For example, a single instantaneous measurement of temperature collected at the same time as a biological sample may be “matched” in time, but the more relevant temperature measurement may be a long-time average temperature at the same site.

M.5.4. Helpful Tips


Data Analysis Methods Home    Previous Page    Next Page


Local Navigation


Jump to main content.