Analyzing Data
M.4. Conditional Probability Analysis
M.4. Conditional Probability Analysis
- 1. What is conditional probability analysis?
- 2. How do I use conditional probability analysis in Stressor Identification?
- 3. Can I use conditional probability analysis with my data?
- 4. Helpful tips
- Authors
Links to Methods
- Click to Expand/Collapse
M.4.1. What is Conditional Probability Analysis?
Conditional probability is the probability of some event, given the occurrence of some other event. In other words, it is the probability of Y, given X, or P (Y | X). Our application of conditional probability uses a dichotomous response variable, which requires that you identify a threshold value of a count or continuous response variable that categorizes an observation as either Y=1 or Y=0. For example, you may be interested in sites with low numbers of taxa from the insect orders Ephemeroptera, Plecoptera, Trichoptera (EPT); if you categorize < 9 EPT taxa as "low" or impaired, then observations with fewer than 9 taxa would be categorized as Y=1.
We use conditional probability analysis (CPA) to answer questions about the probability of observing Y=1 if you also observe a particular condition X. Continuing our example, we might be interested in the probability of observing < 9 EPT taxa when the % fine sediments in the substrate exceeds a given value (Xc), or P(Y=1 | X > Xc). An illustrative graph of this relationship is shown in Figure M.4-1, where the curve represents the probability of observing few EPT taxa (i.e., < 9) when the % fines exceeds a given value. In this example, there is a 0.65 probability of observing < 9 EPT taxa when there are > 40% fine sediments. Also from this figure, there is 100% probability of observing < 9 EPT taxa when the sediments have greater than 60% fines.
Conditional probabilities can be calculated by dividing the joint probability of observing both events by the probability of observing the conditioning event (Equation M.4-1, with a simple illustration for our example in M.4-2).
For our purposes, CPA involves the application of the above analysis technique to biological monitoring data to assist stressor identification in causal analysis. Additional background and detail can be found in Paul and McDonald (2005); however, this paper discusses CPA as applied to identifying thresholds of impact, which is a different purpose than stressor identification.
M.4.2. How Do I Use Conditional Probability Analysis in Stressor Identification?
At this time, CPA can be used primarily as an exploratory tool to help develop the list of candidate causes and provide evidence for the relative strength of the candidate cause. Research is underway to identify other uses of CPA in causal analysis, and this Web site will be updated when these other applications have been developed and demonstrated.
Step 2: Listing Candidate Causes.
Although simple methods such as scatter plots and linear correlation are most commonly used to identify potentially causal associations, CPA can be used to determine the association between a stressor and a response metric. CPA requires no assumptions about the underlying distribution of data. In this sense, CPA should be considered nonparametric.
CPA can indicate if there are certain forms of nonlinear relationships between a stressor and a response that are not identified clearly with other exploratory tools, such as relationships where unacceptable conditions exist for stressor values below one level and above another level. Pearson correlation coefficients and linear regressions can only clearly identify linear relationships. For example, a non-significant correlation coefficient indicates a lack of evidence to support a linear relationship, but says nothing concerning a non-linear relationship.
As an example, consider pH as a candidate stressor in a stream, and a benthic macroinvertebrate index (WVSCI) as the response (Figure M.4-2). A scatter plot of the data shows an apparently domed distribution (Figure M.4-2, left graph). CPA plots indicate that large apparent effects on WVSCI scores are observed for pH < 6 (Figure M.4-2, middle graph)and for pH > 8.5 (Figure M.4-2, right graph). These clearly show that there is a limited pH range - between 6 and 8.5 - where you observe acceptable WVSCI scores. An observed pH outside this range would add to the evidence that pH is the candidate cause at the impaired location. Also note that for pH < 4 and for pH > 10, there is an estimated 100% probability of observing an unacceptable WVSCI score.
M.4.3. Can I Use Conditional Probability Analysis with My Data?
Two general conditions must be met in order to apply CPA to field data, that is to compute the conditional probabilities from your data.
- There must be a set of matching data with a stressor metric that quantifies the candidate cause, paired with a response metric that quantifies the biological effect. In the example above, these are percent fines in the substrate and EPT taxa richness.
- Since CPA requires a dichotomous response variable (i.e., there either is, or is not, an effect), you must identify a threshold value of the response metric that defines unacceptable conditions (e.g., a response value that determines if a water body biologically impaired).
How you do the calculations for your application of CPA depends on whether or not your sites were sampled as part of a probability design. Probability sampling is based on a randomized selection of sampling sites. A probability sample is selected in an explicit manner that allows statements to be made for estimates of the statistical population from which it was selected (Overton, 1990). In contrast, targeted or fixed sites are selected for reasons other than probability. Two key characteristics of a probability sample are that (1) the probability of sampling any element of the statistical population is known (this implies a definition of the statistical population of interest), and (2) the inclusion probability of any sample of the population is positive, that is, all samples have a known non-zero probability of being included in the sample of sites (Cochran, 1977; Overton, 1993). The inclusion probability of any element is defined as the probability with which the element is included in the statistical population.
- If your sites were selected using a probability design, then their inclusion probabilities can be used to weight the analysis and extrapolate the results to the larger statistical population. For example, if the statistical population was defined as all 1st to 3rd order streams in a watershed, then the results would be representative for all 1st to 3rd order streams in that watershed, not just those stream segments sampled.
- If the probability of inclusion of a stream segment is unknown (which is typical for targeted sites or "found" data), the results of the analyses would be expressed in terms of the stream segments for which you have observations, and equal weighting would be applied to the stream segment data.
M.4.4. Helpful Tips
-
Field data are not part of a controlled experimental design and, as such, do not control for specific stressors. Thus, the conditional probabilities are probabilities given both the stressor being analyzed and other co-occurring stressors in the region.
-
The CADDIS Field Stressor-Response Association Gallery page includes examples of conditional probability plots derived from field data.
-
As with all analyses of secondary data (data were not explicitly acquired to meet your project goals), we recommend consulting the original sources to ensure the quality and relevance of influential data.
Cochran, W G. (1977) Sampling Techniques. New York, NY: John Wiley & Sons.
Overton, WS. (1993) Probability sampling and population inference in monitoring programs. In: Environmental Modeling with GIS. Goodchild, MF; Parks BO; Steyaert, LT, eds. Oxford University Press, New York, NY: pp 470-480.
Overton, WS. (1990) A strategy for use of found samples. Corvallis, Oregon: Department of Statistics, Oregon State University.
Paul, JF; McDonald, ME. (2005) Development of empirical, geographically-specific water quality criteria: a conditional probability analysis approach. J Amer Water Res Assoc 41(5):1211-1223.
Data Analysis Methods Home Previous Page Next Page
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)