Analyzing Data
M.2. Correlation Analysis
M.2. Correlation Analysis
- 1. What is correlation?
- 2. How do I use correlation in Stressor Identification?
- 3. Can I use correlation with my data?
- 4. Helpful tips
- Authors
- G.W. Suter II
- P. Shaw-Allen
- S.M. Cormier
- All CADDIS authors, contributors, and reviewers
Links to Methods
- Click to Expand/Collapse
M.2.1. What is Correlation?
Correlation is a method for measuring the degree of association between two variables in a matched data set. The Pearson product-moment correlation coefficient (r) is a unitless value between -1 and 1 measuring the degree of linear association between variables. The corresponding nonparametric analysis calculates a Spearman rank-order correlation coefficient (ρ, rho - pronounced "row") which is computed using the ranks of the data and does not assume that the relationship is linear. Kendall's tau (τ) has the same underlying assumptions as Spearman's rank-order correlation coefficient, but represents the probability that the two variables are ordered nonrandomly.
A value of r, ρ or τ is interpreted as follows:
-
A coefficient of 0 indicates that the variables are not related,
-
A positive coefficient indicates that as one variable increases the other also increases (Figure M.2-1, D),
-
A negative coefficient indicates that as one variable increases, the other decreases (Figure M.2-1, A), and
-
Larger absolute values of coefficients indicate stronger associations (e.g., Figure M.2-1, A vs. C).
However, as we were all taught, such correlations do not prove causation and may be due to confounding or error. Thus, correlation coefficients are only suggestive. In addition, small Pearson product-moment coefficients may be due to nonlinearity (Figure M.2-1, C) rather than to a lack of association (M.2-1, B). Therefore, scatter plots should be examined for nonlinearity as well as to identify outliers or unduly influential data.
M.2.2. How Do I Use Correlation in Stressor Identification?
Correlation analysis is used primarily as a data exploration technique to reveal the degree of association in a set of matched data. The matched data may represent:
-
a candidate causal stressor and a biological attribute,
-
a pair of candidate causal stressors or pairs of biological attributes,
-
a pair of intermediate steps in the causal pathway, or
-
another data match that may be causally associated or may confound a causal association.
Correlation results can be presented as a matrix with different categories of data for the rows and columns (Table M.2-1). Analysis of all variables may be performed to reveal all possible relationships. This is particularly important in revealing potentially confounding relationships prior to stressor-response modeling.
Step 2: Listing Candidate Causes
When identifying candidate causes, one may correlate several measurements that relate potential causal stressors to changes in a biological attribute characterizing the impairment. Relatively large correlation coefficients with appropriate signs (positive or negative) can suggest that a candidate causal stressor should be included. However, correlations should not be used to remove a candidate causal stressor from the list if it is mechanistically plausible or if other evidence supports its inclusion. The correlation may be low because the relationship is nonlinear and has been judged using the parametric Pearson's correlation coefficient or because it is obscured by sampling error or confounding variables.
Correlation also may be used to help in choosing measurements of environmental parameters to represent the candidate causal stressor in the analysis. For example, deposited sediment may be represented by several different measurements (e.g., % fines, embeddedness, median particle size). The measure best correlated with the biological attributes used to define the impairment could be used to represent the candidate causal stressor. To select measurements for environmental parameters, regional data are usually used, because regional data sets are larger and capture a broader range of exposures and effects than data sets from the site of the impairment. Other considerations in choosing the best measurement include their relevance to the mechanism and their relative precision.
Step 3: Evaluating Data from the Case
Correlation using data from the case can be used to provide evidence of stressor-response relationships or steps in the causal pathway. Correlation is used primarily as a preliminary analysis for exploring matched data rather than as an alternative to methods such as regression analysis. In addition to revealing potentially causal relationships between biological attributes and candidate causal stressors, correlation can be used to reveal relationships among candidate causal stressors, between candidate causal stressors and natural environmental gradients, or between different steps in the causal pathway (Figure M.2-2). If measurements or observations of critical steps in the causal pathway are matched with measurements for the candidate causal stressor, correlations of those data can reveal whether they are related. For example, if low concentration of dissolved oxygen is a candidate causal stressor and a phosphate release is a potential source, then a positive correlation of measured phosphate concentrations and algal biomass would provide evidence of the causal pathway from phosphate to low dissolved oxygen.
Note that potentially stronger evidence for causal analysis is provided by regression analysis or other modeling approaches which quantify the change in response over a stressor gradient. Also keep in mind that if two or more candidate causal stressors are correlated, then relationships of either to a biological response may be confounded. For example, if temperature and periphyton production are correlated, then an apparent relationship between temperature and the number of mayfly taxa may be due entirely or in part to changes in trophic resources, rather than to temperature. Hence, knowledge of correlations among candidate causal stressors (e.g., from a correlation matrix) may allow you to more accurately interpret stressor-response models. In addition, correlations among candidate causal stressors must be identified to determine if the assumptions of multivariate techniques such as multiple regression are met.
M.2.3. Can I Use Correlation with My Data?
The first requirement for correlation analysis is that the data must be matched. Assuming matched data, there are different methods for computing correlation coefficients. You must check your data to be sure you select the correct method for computing the correlation. Field data often fail tests for normality and homogeneity of variance, so Pearson's correlation coefficient should not be used unless the data can be transformed to meet the requirements for parametric analysis.
M.2.4. Helpful Tips
- Remember that Pearson correlation coefficients assess the linear relationship between the two variables. Thus, quadratic, and untransformed exponential or logarithmic relationships cannot be fairly evaluated using this method. An alternative approach would be to use Spearman's rank-order or Kendall's τ correlation coefficients.
- Correlation coefficients may be affected by outlier data that may unduly influence the strength of the computed correlation.
Data Analysis Methods Home Previous Page Next Page
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)