Final Report: Statistical Modeling of Waterborne Pathogen Concentrations

EPA Grant Number: R827952
Title: Statistical Modeling of Waterborne Pathogen Concentrations
Investigators: Stedinger, Jery , Ruppert, David
Institution: Cornell University
EPA Project Officer: Hahn, Intaek
Project Period: January 21, 2000 through January 20, 2003
Project Amount: $305,493
RFA: Environmental Statistics (1999) RFA Text |  Recipients Lists
Research Category: Environmental Statistics , Health , Ecosystems


The U.S. Environmental Protection Agency (EPA) strives to maintain high quality drinking water supplies by setting treatment standards for the removal of pathogens (especially Cryptosporidium and Giardia) from raw waters. Because new regulations for pathogen treatment will be national in scope, EPA must develop a good understanding of the distribution of pathogen concentrations where people are at risk of infection. Cryptosporidium is ubiquitous in most streams throughout the country. When and why large concentrations occur is not well understood, nor is the probability distribution of pathogen concentrations. To determine the risk of infection, EPA must understand this variability in concentrations at specific sites around the country where raw water is treated for household and industrial use.

The research focused on development of statistical methods to describe environmental distributions of microorganisms, such as Giardia lamblia and Cryptosporidium parvum protozoa, to support health risk analyses and dynamic risk assessment and corresponding water treatment decisions. Statistical procedures are needed that use data on waterborne pathogens from many sites to characterize the distribution of microorganisms over time at one place and from site-to-site. The analysis is made more complicated because the recovery rate in the laboratory analyses varies from sample-to-sample, and environmental concentrations are relatively low, resulting in a majority of samples yielding zero counts. Thus the analysis of the available national Information Collection Rule (ICR) Cryptosporidium parvum datasets poses a significant challenge because of the imprecision associated with laboratory measurements and the frequent occurrence of zero counts, which provide relatively less information on concentrations than do larger counts.

Summary/Accomplishments (Outputs/Outcomes):

In natural waters, pathogen concentrations vary over time and space. The research project developed a general statistical methodology for modeling environmental pathogen concentrations in natural waters. A hierarchical model of pathogen concentrations captures site and regional random effects as well as random laboratory recovery rates. Recovery rates were modeled by a Generalized Linear Mixed Model (GLMM), and volume-analyzed served as a covariate that explained variations in laboratory recovery rates.

Two classes of pathogen concentration models are differentiated according to their ultimate purpose: water quality prediction or health risk analysis. Water quality prediction can employ variables such as pH, temperature, flow rate, or turbidity levels that can be measured daily, whereas the health risk analysis used only covariates that were predictable, such as urban development, cattle population, and season. A fully Bayesian analysis using Markov Chain Monte Carlo (MCMC) simulation was developed for statistical inference with either model. The applicability of this methodology was illustrated by the analysis of a national survey of Cryptosporidium parvum concentrations, in which 93 percent of the observations were zero counts.

An initial effort used a Generalized Linear Model (GLM) to describe the laboratory recovery rates for both Cryptosporidium and Giardia based upon an EPA ICR spiking study. In general, recovery rates are small and highly variable for both, though they are larger and less variable for Giardia. The analysis revealed that turbidity or volume analyzed as a covariate is statistically significant for Cryptosporidium but not for Giardia. On the other hand, laboratory effects are appreciable for Giardia but not for Cryptosporidium. Because recovery rates for Cryptosporidium and Giardia are small and highly variable, ignoring recovery rates in an analysis of environmental concentrations would underestimate concentrations and exaggerate variability.

Hierarchical models that captured variation with site and regional random effects, as well as random model of pathogen concentrations laboratory recovery rates, fall in a general class of GLMM. Such models can be evaluated using a fully Bayesian statistical analysis and employing MCMC simulation for statistical inference. The performance of such MCMC simulations was assessed for these problems. For datasets with many small and zero counts, numerical mixing by the MCMC algorithm can be very poor, resulting in terrible numerical efficiencies. Several reparameterizations were explored to improve the numerical performance of the MCMC algorithm, particularly the mixing rate. For some parameters, dramatic performance improvements were possible using orthogonalizaton as well as centering of covariates, hierarchical centering of random effects, and gamma instead of log-normal random effects, which allowed analytical integration of the time random effects yielding a negative binomial distribution for observed counts at a site, conditional upon the recovery rate and the site mean. The details are reported in Crainiceanu, et al. (2002), which summarizes this effort.

Crainiceanu, et al. (2003), our final paper, develops a fully Bayesian modeling framework for understanding the variation in environmental pathogen concentrations across sites and across time. The hierarchical model captures site and regional effects, and includes environmental covariates, such as flow rate, and physical and land-use characteristics of different basins. The statistical model is applicable even when the historical pathogen counts are subject to sizeable variation in recovery rates and include many small counts, zero counts, and missing data. The methodology also is applied to understand laboratory recovery rates wherein one is concerned with discrete counts whose mean is explained by covariates including log-volume analyzed and laboratory effects.

Even though both pathogen concentration models are relatively complex, they were easily analyzed with WinBugs, a standard package for the numerical evaluation of the posterior distribution of a Bayesian model using MCMC simulation. Overall, this research project demonstrated that hierarchical Bayesian models are an incredibly flexible and numerically feasible general statistical methodology to describe environmental concentrations of pathogen and microbiological organisms.

Journal Articles on this Report : 1 Displayed | Download in RIS Format

Other project views: All 10 publications 3 publications in selected types All 1 journal articles
Type Citation Project Document Sources
Journal Article Crainiceanu CM, J Stedinger JR, Ruppert D, Behr CT. Modeling the US national distribution of waterborne pathogen concentrations with application to Cryptosporidium parvum. Water Resources Research 2003;39(9):1235. R827952 (Final)
not available

Supplemental Keywords:

drinking water, risk assessment, Bayesian analysis, hydrology, Information Collection Rule, Cryptosporidium parvum, generalized linear mixed models, GLMMs, hierarchical empirical Bayesian models,, RFA, Scientific Discipline, Economic, Social, & Behavioral Science Research Program, Environmental Chemistry, Health Risk Assessment, Environmental Microbiology, Environmental Statistics, Ecological Risk Assessment, health risk analysis, ecosystem assessment, multiple response variables, Bayesian method, computer models, waterborne pathogen concentrations, statistical models, data analysis, innovative statistical models, cryptosporidium, Giardia lamblia, generalized linear models

Relevant Websites: Exit

Progress and Final Reports:

Original Abstract
  • 2000
  • 2001