Analyzing Data
M.5. Regression Analysis
M.5. Regression Analysis
- 1. What is regression analysis?
- 2. How do I use regression analysis in Stressor Identification?
- 3. Can I use regression analysis with my data?
- 4. Helpful tips
- Authors
- G.W. Suter, II
- P. Shaw-Allen
- L.L. Yuan
- S.M. Cormier
- All CADDIS authors, contributors, and reviewers
Links to Methods
- Click to Expand/Collapse
M.5.1. What is Regression Analysis?
Regression analysis is a method for quantifying the relationship between a dependent (response) variable and one or more independent (explanatory) variables. These quantitative models can then be used to predict the value of the response variable for new values of the explanatory variables or to estimate the value of an explanatory variable needed to account for a change in the response variable.
Many different types of regression analysis are available. Distinctions between two types of regression are particularly relevant to the analysis of biological and environmental data and are discussed here.
Parametric vs. Nonparametric Regression
In parametric regression analysis, one explicitly specifies the functional form for the relationship between the response and explanatory variables. Then, regression is used to estimate the best values for the parameters of the model. For example, one might specify a simple linear model as follows:
where z is the observed richness of Ephemeroptera, Plecoptera, and Trichoptera (EPT) taxa at a site, x is a measure of the total metals concentration expressed as total toxic units, and ε is random sampling error. (Data for this example were collected by the EMAP Colorado Streams Assessment.) Then, b0 and b1 are parameters that would be empirically estimated by regression. In this example, regression estimates of the parameter values are b0 = 15.7 and b1 = -7.51 (Figure M.5-1). The region between the dashed lines represents the 95% confidence interval, and the region between the dotted lines is the 95% prediction interval.
In nonparametric regression analysis both the functional form and the parameter values are estimated from the data. The model is only constrained a priori by a “smoothness” parameter that specifies some maximum degree of variability for the fitted curve. Some commonly used nonparametric regression techniques include loess regression, and classification and regression trees. More data are generally required for nonparametric regressions because both model parameters and structure are estimated from the data. In many cases the functional form of the relationship between response and explanatory variables is not known, so nonparametric regressions can provide useful information (Figure M.5-2).
Simple Linear Regression vs. Generalized Linear Regression
Both simple linear and generalized linear regression assume that the response variable is a linear function of the model parameters. However, simple linear regression also assumes that the response variable is continuous and normally distributed, and many types of biological data do not satisfy these two assumptions. For example, total taxon richness is a counted variable and only occurs as integer, non-negative values. Similarly, relative abundance is a proportion and is constrained to values ranging from zero to one. Response variables can occasionally be transformed to more normal distributions. However, generalized linear regression allows one to directly model non-normally distributed responses and can often provide more realistic and easily interpreted models of the relationships between biological responses and explanatory variables.
In the example shown below, a simple linear model describing the relationship between the EPT proportion of total taxa and total nitrogen predicts negative values of the response variable when log(total nitrogen) exceeds 4 (10000 μg/L) (Figure M.5-3). In contrast, the predicted values of the response variable approach zero asymptotically at high nitrogen concentrations when modeled with generalized linear regression.
M.5.2. How Do I Use Regression Analysis in Stressor Identification?
Regression analysis can help visualize the relationships between different stressor variables and the biological response. Scatter plots with superimposed regression curves can therefore help identify candidate causes and help determine whether a particular candidate cause exhibits a stressor-response relationship that supports the case for that stressor.
Regression analysis is also used whenever one needs to quantify the relationship between two or more observed variables. As such, regression analysis is the foundational method for many, more involved analytical tools. For example, regression analysis is used to estimate the distribution of the tolerances of different species to a particular toxicant (see Species Sensitivity Distributions), to help control for natural variability (see Normalizing Data) in biological and environmental variables, and to estimate taxon-environment relationships (see Predicting Environmental Conditions from Biological Observations).
Step 3: Evaluating Data from the Case
Regression analysis of data from the case can provide evidence of spatial/temporal co-occurrence by establishing whether a potential stressor is elevated at the impaired site. However, some environmental factors vary strongly due to natural causes, and these natural variations must be considered before determining whether increased values at an impaired site are comparable to natural expectations. For example, stream temperature in Oregon varies strongly across an elevation gradient, so we use regression analysis to quantify the relationship between stream temperature and elevation at least-impaired reference sites (Figure M.5-4). Then, we can compare the observed temperature at the impaired site with the temperature predicted by the regression model. In the example shown here, observed stream temperature at Site A is much warmer and is located outside the 95% prediction interval of the regression model. Thus, the candidate cause, elevated temperature, co-occurs with the biological impairment at this site. Conversely, the observed temperature at Site B falls within the prediction interval, and therefore, elevated temperature does not co-occur with the impairment at this site. Temperature may also vary over time, so the possibility that elevated temperature preceded a biological response should be considered when evaluating spatial/temporal co-occurrence.
Step 4: Evaluating Data from Elsewhere
Regression analysis may also be used to generate evidence for stressor-response relationships from laboratory data, and stressor-response relationships from other field studies. Regression analysis is frequently used to fit a quantitative relationship between observed biological responses to different levels of exposure to a stressor in laboratory studies and in field manipulations. Examples of these types of models can be found in the Metals Chronic Concentration-Response Gallery and the Metals Species Sensitivity Distribution Gallery, which are both found in the CADDIS Databases section. In some SI cases, these models can then be used to predict the expected magnitude of the biological response at the impaired site, given the observed stressor level. The EPA's Benchmark Dose Software was specifically designed for regression analysis of laboratory data. It fits 17 functions to exposure-response data and provides confidence bounds and multiple statistics for comparison of alternative fitted models.
In certain cases, regression analysis of data from other field studies can provide a quantitative model that estimates the level of biological impairment one would expect, given a certain level of stressor. Such a model can then be used to establish whether the degree of impairment observed at the site of interest is comparable to that observed in other field studies. In general, many different factors can influence biological condition at sites in other field studies. Hence, this type of evidence will usually require a multivariate model that incorporates the effects of the most relevant environmental factors.
M.5.3. Can I Use Regression Analysis with My Data?
Regression analysis can be applied to virtually any type of data. Data for response and explanatory variables should be matched in space and time, although the means by which this matching is accomplished will differ for different data sets and different environmental and biological variables. For example, a single instantaneous measurement of temperature collected at the same time as a biological sample may be “matched” in time, but the more relevant temperature measurement may be a long-time average temperature at the same site.
M.5.4. Helpful Tips
- Regression analysis is frequently used to statistically test whether a hypothesized relationship exists between two variables. These statistical tests examine whether parameter values differ from zero by comparing the magnitude of the parameter with the magnitude of the uncertainty in the estimate. In causal analysis we are usually most interested in the accuracy with which a regression model can be used to predict a biological response, given a change in the environmental conditions. Significance tests do not provide information regarding predictive accuracy and are therefore generally not useful in causal analysis.
- The CADDIS Databases section includes examples of regression analyses in a downloadable gallery of laboratory concentration-response relationships for metals and an on-line Field Stressor-Response Association Gallery of regressions of community metrics on levels of sediments and metals.
- Before selecting a method for regression analysis, examine the data to ensure that it meets the assumptions for that method.
Data Analysis Methods Home Previous Page Next Page
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)