Analyzing Data
M.3. Box Plots
M.3. Box Plots
- 1. What are box plots?
- 2. How do I use box plots in Stressor Identification?
- 3. Can I use box plots with my data?
- 4. Helpful tips
- Authors
- G.W. Suter II
- P. Shaw-Allen
- S.M. Cormier
- All CADDIS authors, contributors, and reviewers
Links to Methods
- Click to Expand/Collapse
M.3.1. What are Box Plots?
Figure M.3-1. A sample box plot with different components of the plot labeled.
Box plots, or box-and-whisker plots, provide insight into the distribution of observations within a data set by dividing it into four sections (Figure M.3-1). The box indicates the spread of the central 50% of the data; the median is denoted by a horizontal line through the box. The portion of the box above the median line denotes the 50th-75th percentile range. Likewise, the portion of the box below the median denotes the 25th-50th percentile range. If all data lie within 1.5 times the interquartile range (75th percentile minus the 25th percentile) from either end of the central box, the whiskers represent the full range of the data. If not, the whiskers extend to 1.5 times the interquartile range and more extreme data are plotted as points. Box plots generated by different software may differ in the percentiles used to denote the box-and-whiskers and other features.
Figure M.3-2. Box plots showing symmetrical or skewed data distributions and different types of kurtosis, or relative spread.
Since box plots depict the distribution of observations, they can be useful for identifying appropriate statistical analyses and deciding whether data should be transformed (Figure M.3-2). For example, box plots can show whether the shape of the data distribution is symmetrical or skewed. If the upper box and whisker are approximately the same length as the lower box and whisker (Figure M.3-2, A), then the data are distributed symmetrically. If the upper box and upper whisker are longer than the lower box and whisker (Figure M.3-2, B and C), then the data are skewed to the right. If the upper box and upper whisker are shorter than the lower box and whisker (Figure M.3-2, D and E), then the data are skewed to the left. Box plots also reveal the kurtosis, or relative spread, of a distribution. The smaller the length of the box relative to the whiskers and points, the tighter the distribution. (Figure M.3-2, B and D). Skewed distributions indicate that the data are not normally distributed, and that the variances may not be homogeneous (Figure M.3-2, B, C, D, and E). When analyzing such data, we recommend that you use nonparametric methods and regression models which accommodate nonlinear data. If you must use parametric methods or linear regression, the data transformation approach should accommodate the type of data (continuous, count, proportion), the skewness of the distribution, and any zero or negative values in the dataset.
M.3.2. How Do I Use Box Plots in Stressor Identification?
Box plots are used primarily to explore data and describe the distributions of single variables representing a candidate cause, a biological attribute, or an intermediate step in a causal pathway. By classifying the data with respect to another variable, you can also use box plots to examine the causal relationship from a particular set of exposures, responses, or other factors (Figures M.3-3 and M.3-4). For example, a biological attribute such as Ephemeroptera, Plecoptera, and Trichoptera (EPT) richness may be classified in terms of the relative level of a stressor, with box plots created for each stressor level (Figure M.3-3); comparing these box plots may illustrate differences in the biological attribute between the two levels of the stressor. Alternatively, measurements of the stressor can be classified in terms of unimpaired versus impaired conditions, or high versus low levels of a specific biological attribute (Figure M.3-4); comparing these box plots may indicate whether the stressor was consistently higher in the impaired group.
Step 3: Evaluating Data from the Case
When sufficient data from the case are available, box plots can provide evidence for spatial/temporal co-occurrence, stressor-response relationships from the field, and causal pathway. For example, the data set may be divided into groups of data from biologically impaired and unimpaired locations as illustrated in Figure M.3-4. The location and spread of the box plots can be used to evaluate whether the data provide evidence for or against the stressor as a cause of the impairment (Figure M.3-5). Elevated levels of the stressor co-occurs with the impaired biological condition when the overall stressor range of the box plot for data from impaired locations is greater than and does not overlap the box plot for data from unimpaired locations (Figure M.3-5, A); it is acceptable to have some degree of overlap of the whiskers due to habitat variables or sampling error. If they overlap extensively (Figure M.3-5, B), elevated levels of the stressor co-occur with both impaired and unimpaired biological condition and therefore the association weakens the body of evidence for that candidate cause. If there is a moderate overlap (Figure M.3-5, C), co-occurrence is uncertain. Box plots provide strong evidence against the stressor as a candidate cause when distributions do not differ, or when differences are in the wrong direction (Figure M.3-5, D). Similarly, box plots can depict stressor response relationships from the field when site data are grouped according to stressor intensity and the range in the biological attribute is clearly different at the differing stressor intensities. Box plots can also be used to examine the relative distribution of environmental parameters at intermediate steps of causal pathways. In this application, evidence showing that a parameter necessary for the causal pathway differs in the expected way at impaired sites supports the candidate cause associated with that pathway.
Step 4: Evaluating Data from Elsewhere
Box plots can also provide stressor-response relationships from other field studies. Categorical stressor-response relationships from field data collected from other sites (e.g., from across a region), can be compared to data from the impaired site. In a typical application, box plots are created for high and low values of the stressor across a region so that the degree of impairment can be compared.
The following characteristics of paired box plots created from regional data suggest that stressor-response evidence from the field supports the candidate cause. Lacking these characteristics weakens the evidence.
- Boxes do not overlap.
- Boxes are small.
- Measures of the biological attribute at impaired locations fall within the box for the extreme level predicted by the causal hypothesis.
- The level of the candidate causal stressor at the impaired locations falls within the appropriate regional extreme range.
- Criteria 1-4 are true for most, if not all, of the biological responses that define the impairment.
Specifically:
Figure M.3-6. Left panel: Two box plots categorized by either the biological response or stressor intensity. Measurement from an impaired assessment site is indicated by circle; the filled circle supports and the open circle weakens the candidate cause. Right panel: The two box plots overlap and do not resolve differences between the categories selected for comparison and generally weaken the evidence regardless of levels observed at the impaired site.
-
Boxes do not overlap (Figure M.3-6, left panel) indicating that the stressor and the impairment of interest co-occur elsewhere in the region. Conversely, if the distributions overlap (Figure M.3-6, right panel), the stressor/response relationship is poorly resolved by the field data and weakens the evidence for that cause.
-
Boxes are smaller for well defined categories of exposure or effect. If the stressor is important, you expect a relatively consistent effect; however, this criterion is weaker than the first criterion.
-
Measures of the biological attribute at impaired locations fall within the box for the extreme level predicted by the causal hypothesis. For example, if the number of taxa at the locations from the impaired site falls within the inter-quartile range in number of taxa for locations with extremely high stressor levels (Figure M.3-7, A), evidence for that stressor as a causal agent is strengthened.
-
The intensity or level of the stressor at impaired locations falls within the regional extreme range. If measures of the impairment at the site under investigation are similar to the levels of the biological attributes at an extreme level of the stressor (i.e., criterion 3, above, is met), then the level of the stressor should be extreme at the impaired location (Figure M.3-7, A).
-
Criteria 1-4 are true for most or all of the biological responses that define the impairment. That is, if the impairment is defined by changes in multiple biological attributes and if the stressor is responsible for the impairment, that stressor should be associated with most or all of the biological responses. If this condition is not true, then the changes in biological attributes must be different effects with different causes. For example, in the Little Scioto case study, the larger biomass of fish at the site where the impairment occurs is attributed to deepening of the channel, and the reduced number of invertebrate taxa is attributed to siltation.
M.3.3. Can I Use Box Plots with My Data?
It is not difficult to use box plots to display and compare the distributions of data sets. However, this method of displaying data and comparing candidate causes and effects requires matching data and presumes a monotonically increasing or decreasing stressor-effect relationship. For example, if conditions are optimal at an intermediate level of the stressor and sub-optimal at either extreme, comparisons of box plots for the two categories would be uninformative or even misleading.
M.3.4. Helpful Tips
- Box plots are easily constructed and may clearly display results, but regression analysis or quantile regression are generally more powerful statistical techniques for revealing causal relationships.
- When gathering evidence for a causal assessment, do not double count your evidence by including box plots with the same data grouped by impairment and by stressor intensity.
- In preparing box plots, you are implicitly assuming that all sites are equally weighted. For data in which samples are weighted unequally (e.g., probabilistic sampling designs), cumulative distribution functions are more appropriate for examining the distribution of values.
- While box plots are more commonly used, you may want to consider other graphical exploratory data analysis techniques.
- Be cautious when splitting data into high stressor/low stressor or impaired/unimpaired categories. The break points you select will affect data interpretation.
Data Analysis Methods Home Previous Page Next Page
![[logo] US EPA](http://www.epa.gov/epafiles/images/logo_epaseal.gif)