Jump to main content.


Analyzing Data

pixel.gif
 This image is a drawing of a caddisfly larva in its case. Caddisflies are aquatic insects that are used by biologists to monitor the environmental quality of streams.


M.2. Correlation Analysis

M.2.1. What is Correlation?

Correlation is a method for measuring the degree of association between two variables in a matched data set. The Pearson product-moment correlation coefficient (r) is a unitless value between -1 and 1 measuring the degree of linear association between variables. The corresponding nonparametric analysis calculates a Spearman rank-order correlation coefficient (ρ, rho - pronounced "row") which is computed using the ranks of the data and does not assume that the relationship is linear. Kendall's tau (τ) has the same underlying assumptions as Spearman's rank-order correlation coefficient, but represents the probability that the two variables are ordered nonrandomly.

Data sets with strong and weak correlation
Figure M.2-1. Examples of different correlations between two variables, x and y: (A) with an r value of -0.8, the band of points indicates a decrease in y with an increase in x; (B) with an r value of 0.1, points are diffusely scattered throughout the plot area; (C) with an r value of 0.3, the points indicate a weak increase in y with an increase in x, or perhaps a nonlinear relationship; and (D) with an r value of 0.8, the band of points indicates an increase in y with an increase in x.

A value of r, ρ or τ is interpreted as follows:

However, as we were all taught, such correlations do not prove causation and may be due to confounding or error. Thus, correlation coefficients are only suggestive. In addition, small Pearson product-moment coefficients may be due to nonlinearity (Figure M.2-1, C) rather than to a lack of association (M.2-1, B). Therefore, scatter plots should be examined for nonlinearity as well as to identify outliers or unduly influential data.

Top of Page


M.2.2. How Do I Use Correlation in Stressor Identification?

Correlation analysis is used primarily as a data exploration technique to reveal the degree of association in a set of matched data. The matched data may represent:

Correlation results can be presented as a matrix with different categories of data for the rows and columns (Table M.2-1). Analysis of all variables may be performed to reveal all possible relationships. This is particularly important in revealing potentially confounding relationships prior to stressor-response modeling.

Data sets with strong and weak correlation
Table M.2-1. A correlation matrix for the measures of candidate causes (letters) and biological effects (numbers). The contents of the cells are linear Pearson's correlation coefficients (r).

Step 2: Listing Candidate Causes

When identifying candidate causes, one may correlate several measurements that relate potential causal stressors to changes in a biological attribute characterizing the impairment. Relatively large correlation coefficients with appropriate signs (positive or negative) can suggest that a candidate causal stressor should be included. However, correlations should not be used to remove a candidate causal stressor from the list if it is mechanistically plausible or if other evidence supports its inclusion. The correlation may be low because the relationship is nonlinear and has been judged using the parametric Pearson's correlation coefficient or because it is obscured by sampling error or confounding variables.

Correlation also may be used to help in choosing measurements of environmental parameters to represent the candidate causal stressor in the analysis. For example, deposited sediment may be represented by several different measurements (e.g., % fines, embeddedness, median particle size). The measure best correlated with the biological attributes used to define the impairment could be used to represent the candidate causal stressor. To select measurements for environmental parameters, regional data are usually used, because regional data sets are larger and capture a broader range of exposures and effects than data sets from the site of the impairment. Other considerations in choosing the best measurement include their relevance to the mechanism and their relative precision.

Step 3: Evaluating Data from the Case

Correlation using data from the case can be used to provide evidence of stressor-response relationships or steps in the causal pathway. Correlation is used primarily as a preliminary analysis for exploring matched data rather than as an alternative to methods such as regression analysis. In addition to revealing potentially causal relationships between biological attributes and candidate causal stressors, correlation can be used to reveal relationships among candidate causal stressors, between candidate causal stressors and natural environmental gradients, or between different steps in the causal pathway (Figure M.2-2). If measurements or observations of critical steps in the causal pathway are matched with measurements for the candidate causal stressor, correlations of those data can reveal whether they are related. For example, if low concentration of dissolved oxygen is a candidate causal stressor and a phosphate release is a potential source, then a positive correlation of measured phosphate concentrations and algal biomass would provide evidence of the causal pathway from phosphate to low dissolved oxygen.

An example of sequential correlation
Figure M.2-2. Diagram suggesting how sequential correlation analyses can support or weaken evidence of a causal pathway or stressor-response.

Note that potentially stronger evidence for causal analysis is provided by regression analysis or other modeling approaches which quantify the change in response over a stressor gradient. Also keep in mind that if two or more candidate causal stressors are correlated, then relationships of either to a biological response may be confounded. For example, if temperature and periphyton production are correlated, then an apparent relationship between temperature and the number of mayfly taxa may be due entirely or in part to changes in trophic resources, rather than to temperature. Hence, knowledge of correlations among candidate causal stressors (e.g., from a correlation matrix) may allow you to more accurately interpret stressor-response models. In addition, correlations among candidate causal stressors must be identified to determine if the assumptions of multivariate techniques such as multiple regression are met.

Top of Page


M.2.3. Can I Use Correlation with My Data?

The first requirement for correlation analysis is that the data must be matched. Assuming matched data, there are different methods for computing correlation coefficients. You must check your data to be sure you select the correct method for computing the correlation. Field data often fail tests for normality and homogeneity of variance, so Pearson's correlation coefficient should not be used unless the data can be transformed to meet the requirements for parametric analysis.

Top of Page


M.2.4. Helpful Tips

Top of Page



Data Analysis Methods Home    Previous Page    Next Page


Local Navigation


Jump to main content.