Jump to main content.


Analyzing Data

pixel.gif
 This image is a drawing of a caddisfly larva in its case. Caddisflies are aquatic insects that are used by biologists to monitor the environmental quality of streams.


M.4. Conditional Probability Analysis

M.4.1. What is Conditional Probability Analysis?

Scatter plot of WVSCI versus pH
Figure M.4-1. A plot of the conditional probability, P ( Y = 1 | X > Xc ) for X = percent fines in stream segment substrate and Y = 1 is defined as EPT < 9.

Conditional probability is the probability of some event, given the occurrence of some other event. In other words, it is the probability of Y, given X, or P (Y | X). Our application of conditional probability uses a dichotomous response variable, which requires that you identify a threshold value of a count or continuous response variable that categorizes an observation as either Y=1 or Y=0. For example, you may be interested in sites with low numbers of taxa from the insect orders Ephemeroptera, Plecoptera, Trichoptera (EPT); if you categorize < 9 EPT taxa as "low" or impaired, then observations with fewer than 9 taxa would be categorized as Y=1.

We use conditional probability analysis (CPA) to answer questions about the probability of observing Y=1 if you also observe a particular condition X. Continuing our example, we might be interested in the probability of observing < 9 EPT taxa when the % fine sediments in the substrate exceeds a given value (Xc), or P(Y=1 | X > Xc). An illustrative graph of this relationship is shown in Figure M.4-1, where the curve represents the probability of observing few EPT taxa (i.e., < 9) when the % fines exceeds a given value. In this example, there is a 0.65 probability of observing < 9 EPT taxa when there are > 40% fine sediments. Also from this figure, there is 100% probability of observing < 9 EPT taxa when the sediments have greater than 60% fines.

Conditional probabilities can be calculated by dividing the joint probability of observing both events by the probability of observing the conditioning event (Equation M.4-1, with a simple illustration for our example in M.4-2).

Equation for calculating ionic strength.  Source: IUPAC Quantities, Units and Symbols in Physical Chemistry, 1993.
Equation M.4-1.
Equation for calculating ionic strength.  Source: IUPAC Quantities, Units and Symbols in Physical Chemistry, 1993.
Equation M.4-2.

For our purposes, CPA involves the application of the above analysis technique to biological monitoring data to assist stressor identification in causal analysis. Additional background and detail can be found in Paul and McDonald (2005); however, this paper discusses CPA as applied to identifying thresholds of impact, which is a different purpose than stressor identification.

M.4.2. How Do I Use Conditional Probability Analysis in Stressor Identification?

At this time, CPA can be used primarily as an exploratory tool to help develop the list of candidate causes and provide evidence for the relative strength of the candidate cause. Research is underway to identify other uses of CPA in causal analysis, and this Web site will be updated when these other applications have been developed and demonstrated.

Step 2: Listing Candidate Causes.

Although simple methods such as scatter plots and linear correlation are most commonly used to identify potentially causal associations, CPA can be used to determine the association between a stressor and a response metric. CPA requires no assumptions about the underlying distribution of data. In this sense, CPA should be considered nonparametric.

CPA can indicate if there are certain forms of nonlinear relationships between a stressor and a response that are not identified clearly with other exploratory tools, such as relationships where unacceptable conditions exist for stressor values below one level and above another level. Pearson correlation coefficients and linear regressions can only clearly identify linear relationships. For example, a non-significant correlation coefficient indicates a lack of evidence to support a linear relationship, but says nothing concerning a non-linear relationship.

As an example, consider pH as a candidate stressor in a stream, and a benthic macroinvertebrate index (WVSCI) as the response (Figure M.4-2). A scatter plot of the data shows an apparently domed distribution (Figure M.4-2, left graph). CPA plots indicate that large apparent effects on WVSCI scores are observed for pH < 6 (Figure M.4-2, middle graph)and for pH > 8.5 (Figure M.4-2, right graph). These clearly show that there is a limited pH range - between 6 and 8.5 - where you observe acceptable WVSCI scores. An observed pH outside this range would add to the evidence that pH is the candidate cause at the impaired location. Also note that for pH < 4 and for pH > 10, there is an estimated 100% probability of observing an unacceptable WVSCI score.

Scatter plot and conditional probability plots of a biological index score (WVSCI) versus pH
Figure M.4-2. Left graph: Scatter plot of the WVSCI benthic macroinvertbrate index versus pH; middle graph: conditional probability plot of P ( Y = 1 | X < Xc ); right graph: conditional probability of P ( Y = 1 | X > Xc ), for X = pH and Y = 1 is defined as WVSCI < 60 (data courtesy of West Virginia Department of Environmental Protection).

M.4.3. Can I Use Conditional Probability Analysis with My Data?

Two general conditions must be met in order to apply CPA to field data, that is to compute the conditional probabilities from your data.

  1. There must be a set of matching data with a stressor metric that quantifies the candidate cause, paired with a response metric that quantifies the biological effect. In the example above, these are percent fines in the substrate and EPT taxa richness.
  2. Since CPA requires a dichotomous response variable (i.e., there either is, or is not, an effect), you must identify a threshold value of the response metric that defines unacceptable conditions (e.g., a response value that determines if a water body biologically impaired).

How you do the calculations for your application of CPA depends on whether or not your sites were sampled as part of a probability design. Probability sampling is based on a randomized selection of sampling sites. A probability sample is selected in an explicit manner that allows statements to be made for estimates of the statistical population from which it was selected (Overton, 1990). In contrast, targeted or fixed sites are selected for reasons other than probability. Two key characteristics of a probability sample are that (1) the probability of sampling any element of the statistical population is known (this implies a definition of the statistical population of interest), and (2) the inclusion probability of any sample of the population is positive, that is, all samples have a known non-zero probability of being included in the sample of sites (Cochran, 1977; Overton, 1993). The inclusion probability of any element is defined as the probability with which the element is included in the statistical population.

M.4.4. Helpful Tips


References

Cochran, W G. (1977) Sampling Techniques. New York, NY: John Wiley & Sons.

Overton, WS. (1993) Probability sampling and population inference in monitoring programs. In: Environmental Modeling with GIS. Goodchild, MF; Parks BO; Steyaert, LT, eds. Oxford University Press, New York, NY: pp 470-480.

Overton, WS. (1990) A strategy for use of found samples. Corvallis, Oregon: Department of Statistics, Oregon State University.

Paul, JF; McDonald, ME. (2005) Development of empirical, geographically-specific water quality criteria: a conditional probability analysis approach. J Amer Water Res Assoc 41(5):1211-1223.


Data Analysis Methods Home    Previous Page    Next Page


Local Navigation


Jump to main content.