Final Report: Combining Environmental Data Sets

EPA Grant Number: R829095C001
Subproject: this is subproject number 001 , established and managed by the Center Director under grant R829095
(EPA does not fund or establish subprojects; EPA awards and manages the overall grant for this center).

Center: Space-Time Aquatic Resources Modeling and Analysis Program (STARMAP)
Center Director: Urquhart, N. Scott
Title: Combining Environmental Data Sets
Investigators: Hoeting, Jennifer A. , Breidt, F. Jay , Davis, Richard A. , Gitelman, Alix I. , Johnson, Devin S. , Ritter, Kerry J. , Stevens, Don L.
Institution: Colorado State University , Oregon State University , Southern California Coastal Water Research Project Authority , University of Alaska - Fairbanks
EPA Project Officer: Packard, Benjamin H
Project Period: October 1, 2001 through September 30, 2006
RFA: Research Program on Statistical Survey Design and Analysis for Aquatic Resources (2001) RFA Text |  Recipients Lists
Research Category: Aquatic Ecosystems , Ecological Indicators/Assessment/Restoration , Water and Watersheds , Water , Ecosystems


The objectives are to develop approaches for spatio-temporal design and modeling to further understanding of aquatic resources; and develop, test, and implement alternative spatial sampling designs for near-coastal systems.

Summary/Accomplishments (Outputs/Outcomes):

This project expanded the analysis and interpretation tools available to aquatic scientists, and statisticians who assist them, especially with tools which utilize spatial or temporal models, as well as ones which utilize hierarchical (Bayesian) methods. Specifically this project:

  • Adapted spatial statistical models to accommodate the branching nature of stream networks;
  • Implemented statistical computing tools to support the selection of predictor variables in an aquatic context;
  • Developed hierarchal methods for the analysis of categorical responses, of the sort resulting from macroinvertebrate studies in streams;
  • Developed hierarchal methods for the analysis of ordered categorical responses, of the sort resulting from studies to monitor stream health;
  • Developed and demonstrated hierarchical analysis methods for assigning causes of effects in aquatic systems like stream networks;
  • Produced a textbook that surveys modern methods in computational statistics;
  • Expanded temporal methods for identifying structural breaks;
  • Investigated the uncertainty associated with contour curves developed from spatial statistical models;
  • Developed sampling plans for near-coastal systems;
  • Expanded components of variance tools for characterizing both temporal and spatial variability; and
  • Trained future statisticians, some of whom are already working in environmental fields.

Summary of Work

Project 1 involved a substantial number of researchers working in a large variety of areas. This section summarizes the output of Project 1.

Methodology for Statistical Modeling of Spatially-referenced, Aquatic, and Other Environmental Data. Principal investigator (PI) Hoeting produced a textbook, Computational Statistics, under this cooperative agreement, co-authored by G.H. Givens. This graduate-level textbook surveys a wide variety of topics in modern statistical computing and computational statistics, including optimization, integration (including Markov chain Monte Carlo [MCMC] methods), bootstrapping, and smoothing. The book has been adopted widely for teaching and is being used by statisticians and non-statisticians alike as a reference book on methods for computational statistics. The book includes a Web page with related software and numerous examples and homework problems. Computational Statistics has become a bestselling textbook for Wiley (publisher) and is currently in its fourth printing.

Investigator Gitelman’s “Isomorphic Chain Graphs for Modeling Spatial Dependence in Ecological Data” by Gitelman and Herlihy (published in Environmental and Ecological Statistics) was an important contribution toward developing causal inference models. In this paper, they extend Bayesian belief network models (also called acyclic directed graphs) to accommodate correlation through space.

In conjunction with this work, Investigator Gitelman was invited to serve as the guest associate editor for a special forum on the application of Bayesian belief networks to natural resource management problems organized by the Canadian Journal for Forest Research. She was invited to participate in a section on models for multi-scale analysis at the 2005 Annual Meeting of the Ecological Society of America in Montreal and to review a collection of papers for inclusion in EcoHealth, a Springer publication in environmental science (to appear).

PI Hoeting, PI Davis, Space-Time Aquatic Resources Modeling and Analysis Program (STARMAP)-funded student Andrew Merton, and S.E. Thompson (Pacific Northwest National Laboratory) developed important methodology for model selection in regression-like spatial models (called geostatistical models). In a paper that appeared in Ecological Applications these co-authors show that ignoring spatial correlation can result in models being selected that are not reflective of the variables that generated the data. This work also produced freely available software that has already been used in several publications to select explanatory variables for geostatistical models.

In closely allied work, Davis and Merton derived the limiting behavior for the maximum likelihood estimator of the range parameter in an exponential covariance function under various scenarios including infill and increasing domain asymptotics. Limit behavior for the case when the sampling strategy within blocks was clustered, regular, or random was also considered. It was shown that for the exponential case in one dimension, all three sampling paradigms produced asymptotically equivalent estimates.

Investigator Gitelman, in collaboration with PI Hoeting and STARMAP-funded student Irvine (at Oregon State University [OSU]), has submitted a manuscript in which the properties of spatial covariance are examined in situations with varying amount of spatial dependence, under different sampling designs and for different sample sizes. This work demonstrated that large sample sizes are needed to accurately model spatially-dependent data and that sampling pattern can impact the quality of parameter estimates.

Investigator Gitelman and STARMAP-funded student Megan Dailey developed new methodology reported in the paper “Habitat selection models to account for seasonal persistence in radio telemetry data,” published in Environmental and Ecological Statistics. In this work, the authors build a flexible hierarchical model for seasonal persistence and seasonal changes in habitat selection, all fit using Bayesian modeling. The model was applied in an attempt to understand habitat selection behaviors in fish in Oregon streams.

PI Hoeting worked with ecologists to investigate various factors affecting the accuracy of predicted species distributions in a paper that appeared in Ecological Applications in 2005 (co-authors G. Reese, K. Wilson, and C. Flather). In related work on species distributions, Hoeting served as the discussant for ground-breaking work on Bayesian models for species distributions (conference and journal article in Bayesian Analysis in 2006). In another paper that appeared in Ecological Applications in 2006, PI Hoeting (and co-authors M. Farnsworth, N.T. Hobbs, and M.W. Miller) developed models to examine how animal movement can be linked to the spread of disease. In a manuscript that appeared in Biometrics in 2003, PI Hoeting and STARMAP-funded student D. Johnson developed new models for capture-recapture data. While not directly related to aquatic resource monitoring, the models that appeared in these publications will help ecologists further understand where species live, why they live there, and how diseases impact these organisms. This work, in conjunction with aquatic resource monitoring, will help us further understand entire ecosystems.

STARMAP-funded student D.S. Johnson, PI Hoeting, and N.L. Poff (a researcher on U.S. Environmental Protection Agency [EPA]-funded Science To Achieve Results [STAR] project R828636) developed new models for monitoring aquatic resource data. The new models are for multiple variable, categorical data where researchers are interested in modeling the proportion of observations in each group as well as the relationships between these groups and various explanatory variables. These models were used to investigate the relationship between fish traits and environmental predictors of the presence of fish with these traits, using the EPA Environmental Monitoring and Assessment Program (EMAP) Mid-Atlantic Highlands Assessment (MAHA) data set. This work appeared in a recent book Bayesian Statistics and its Applications (2006).

Confidence Bounds for Map Contours. Josh French, a graduate student formerly supported by STARMAP, is working with Richard Davis devising confidence bands around level curves for a spatial field. The idea is that from a map produced via kriging one can display level curves of the predicted conditional mean. However, calculation of error bounds for these level curves are not so easy to construct or even to define in a probabilistic sense. Davis and French are developing procedures that allow one to put confidence bands around level curves, displayed as small rectangular boxes that have a preset confidence probability. In other words, these boxes can be constructed so that we are 95 percent confident the process takes the desired threshold value somewhere in the box. The boxes are then linked together to give a “confidence set” for the level curve. This work is still ongoing, illustrating that the effect of STARMAP’s funding from EPA is extending past its funding period.

Structural Breaks in Time Series. PI Davis worked with colleague Thomas Lee and postdoctoral fellow Gabriel Rodriguez-Yam, exploring the problem of detecting structural breaks in a time series. The key idea was to assume that the nonstationary time series can be well represented by piecewise autoregressions (AR). The principle of minimum description length (MDL) was used to assess the quality of fit for various structural break locations and the genetic algorithm was used to find near optimal minima of the MDL. The number of structural breaks, their locations, and the orders of the respective piecewise AR models were assumed unknown. This paradigm seemed to work well in a variety of applications. This research was published in the Journal of the American Statistical Association.

Methodology for Sampling. Investigator Ritter and associates of the Southern California Coastal Water Research Project (SCCWRP) developed spatially distributed sampling plans suited to wastewater oceanic outfalls. Their project investigated cost-effective ways to distribute sample points in a near-coastal system to support estimation of the spatial pattern of various analytes and macroinvertebrate indices. The work is summarized in a paper that appeared in Environmental and Ecological Statistics. Ritter applied these methods to produce a design which was implemented by the San Diego Metropolitan Wastewater District.

Some of Director Urquhart’s investigations of components of variance concerned temporal and spatial matters so are reported under this Project. When field visits to aquatic sites are widely distributed in space, and occasionally in time, components of variance can be used to capture most of the spatial and temporal variation but not to model its form. Nevertheless such characterizations have proved useful in evaluating the likely (statistical) power to detect trend. The developed methodology was used to compare temporal or revisit designs relative to their power to detect trend. Some are decidedly better than others, a fact which was communicated to various interested clients.

Training Future Environmental Statisticians

A significant outcome of this project included a large number of students trained in environmental statistics. A number of these students are now working in fields directly related to environmental statistics in the United States. The students involved in Project 1 include six researchers who have completed their degrees and four students whose degrees are in progress. Researchers who have completed their degrees include:

  • Devin Johnson, Ph.D. from Colorado State University (CSU) in Statistics, 2003. Thesis title: Bayesian Analysis of State-Space Models for Discrete Compositions. After working for 2 years on the faculty at the University of Alaska at Fairbanks, Johnson is now a statistician at the National Marine Mammal Laboratory, Alaska Fisheries Science Center, National Oceanic and Atmospheric Administration (NOAA). Advisor: PI Hoeting.
  • Steve Jensen, M.S. from OSU in Statistics, Brief Introduction to Reversible Jump MCMC for Bayesian Networks and an Application (advisor: Investigator Gitelman). Jensen currently works for a small graphical modeling software company, which includes environmental applications.
  • Brett Kellum, M.S. from CSU in Statistics, 2003. Analysis and Modeling of Acid Neutralizing Capacity in the Mid-Atlantic Highlands Area. Advisor: PI Hoeting.
  • Andrew Merton, Ph.D. from CSU in Statistics, 2006, Geostatistical Models: Model Selection and Parameter Estimation under Infill and Expanding Domain Asymptotics, Advisors: PIs Hoeting and Davis. Merton is currently working as a contract employee for the National Park Service. He is developing sampling plans for inventory and monitoring of vegetation, water quality, and a number of other factors for a number of National Parks in the Western United States.
  • Julia Smith, M.S. from CSU in Statistics, 2005, Modeling and Predicting Median Substrate Size in Oregon and Washington Streams Utilizing Geographic Information Systems. This work was joint work with Brian Bledsoe, another EPA STAR-grant researcher. Ms. Smith teaches high school advanced placement statistics in Anchorage, Alaska, transmitting ideas of environmental statistics to future generations of environmental scientists.
  • Sarah Williams, M.S. from CSU, 2006. A Comparison of Variance Estimates of Stream Network Resources. (Advisor: Director Urquhart.)

Degrees currently in progress:

  • Stephanie Fitchett, current M.S. student, CSU. Anticipated graduation date: 2007. Fitchett is extending work by Erin Peterson (STARMAP Project 3 [R829095C003]) to develop user-friendly guidelines for GIS users as well as geostatistical models to be used to analyze stream monitoring data. (Advisors: Director Urquhart and PI Hoeting.)
  • Joshua French, current Ph.D. student, CSU. Anticipated graduation date: 2008. French is developing models for uncertainty in contour lines for spatial maps, including contours for oceanic pollutants. (Advisor: PI Davis.)
  • Megan Dailey Higgs, current Ph.D. student, CSU. Anticipated graduation date: 2007. Higgs is currently working on Bayesian models for ordered categorical spatial data and categorical habitat data. These models are being developed for analysis of a number of EPA data sets and will be particularly useful to model data on stream health and to predict health at unobserved sites. Advisor: PI Hoeting.
  • Kathryn Irvine, current Ph.D. student, OSU. Anticipated graduation date: 2007. Irvine’s Ph.D. work is in two separate but related topics. She examined behavior of parameter estimates in geostatistical models in a paper currently submitted to a peer-reviewed journal. For a related paper, Irvine was awarded second place honorable mention in the student paper competition sponsored by the Section on Statistics in the Environment of the American Statistical Association in 2006. Her current Ph.D. work extends the work of Gitelman and Herlihy on several features of graphical models that will allow simple-to-understand graphical model diagrams explaining relationships for aquatic data. In a position similar to Merton’s described above, Irvine is also currently working as a contract employee for the National Park Service developing survey and monitoring plans for National Parks in the Western United States.

Another outcome of the project is the postdoctoral training offered to Man Sik Park, who worked on the project from late 2005 through fall 2006. This work produced two manuscripts, which have all been submitted to peer-reviewed journals, and one working paper. Two of the manuscripts propose new ways to examine spatio-temporal data, and one manuscript proposes new models for spatially-referenced data. The last manuscript, related to Gaussian Markov random fields, can be used to analyze very large data sets and is an improvement over existing methods.

The investigators on this project were widely scattered throughout the Western United States. The project and the related STARMAP and Designs and Models for Aquatic Resource Surveys (DAMARS) workshops allowed for additional interactions that would not have been possible without EPA funding. These links have and will continue to lead to the development of new methodology to collect and analyze aquatic data. Interactions between Ritter and Urquhart; Hoeting and Gitelman; and Merton, Theobald (Project 3), Urquhart, Ver Hoef, and Peterson are just several examples of the new linkages that developed under this funding. All these interactions produced new methodology for the analysis of aquatic resource data.

Significance of Accomplishments

A major accomplishment of Project 1 was the training of statisticians in environmental problems. As described above, a number of these students are already using the knowledge gained from working on the STARMAP project in various government agencies and private businesses.

The model selection component of the research accomplishments serves as a warning to scientists in selecting covariates in geospatial models. This process needs to be conducted in concert with the modeling of the error term. Often the error term can be used as a proxy for missing covariates or can be used as a correction factor for incorrectly selected covariates.

The asymptotic theory for the exponential covariance function is useful to the scientist for selecting optimal sampling strategies with the goal of producing the most efficient parameter estimates. In the exponential case, it is difficult to beat a uniform sampling plan. This might change for the Matern covariance function, which is a topic of future study.

The structural break detection research has shown great promise in segmenting a time series into stationary segments. The strategy developed, called AutoPARM for Automatic Piecewise AR Modeling, is a general procedure that overcomes many of the limitations and defects of previously proposed procedures. In addition, very few assumptions are made in this framework. We intend to explore versions of AutoPARM that would apply more directly in the geospatial context.

Ritter of the SCCWRP investigated cost-effective ways to distribute sample points in near-coastal systems to support estimation of the spatial pattern of various analytes and macroinvertebrate indices. The resulting design was implemented by the San Diego Metropolitan Wastewater District. This design will likely serve as a prototype for many similar studies along the California coast, developed in collaboration with the SCCWRP.

The work on power to detect trend in aquatic surveys by Urquhart has been and is being used by designers of both aquatic and terrestrial surveys to make effective use of limited resources. This work is widely cited in publications concerned with the design of environmental surveys.

Stakeholders and Users of Results

Since Project 1 covered a wide variety of topics, the stakeholders and users of the results also vary widely.

The textbook Computational Statistics has been adopted by a number of universities including Stanford University, The Ohio State University, University of Minnesota, Bowling Green State University, and others. The book is also sold internationally. It is anticipated that as more universities adopt the textbook, this work will be disseminated throughout the United States and used to educate statisticians, with particular emphasis on the education of future environmental statisticians due to the ecological examples used in the book.

This textbook has resulted in four short courses, to date. These courses have served a wide audience of U.S.-based statisticians, including a number of statisticians working for federal and state government agencies, including several from environmental agencies. The courses were presented to:

  • Statisticians attending the Joint Statistical Meetings, August 2006, Seattle, WA.
  • Statisticians affiliated with the Alaska chapter of the American Statistical Association (participants were mainly employed by U.S. Fish and Wildlife Service, Alaska Fish & Game Department, and affiliated with the University of Alaska at Fairbanks).
  • Statisticians attending the Joint Statistical Meetings, August 2005, Minneapolis.
  • Statisticians attending the American Statistical Association Section on Statistics and the Environment and the University of Chicago Center for Integrating the Statistical and Environmental Sciences, October 2004, Chicago.

We expect that the modeling tools that we have developed will be useful to a wide range of users of statistics in the environmental sciences, geosciences, biology, and medical professions. A current example of the structural break work deals with record sounds in the National Parks. There is interest in segmenting a long audio stream of recordings obtained from microphones strategically placed in many of the National Parks. One of the goals of this data collection process was to allow the National Park Service to measure and monitor noise pollution in the national parks. One measure of the pollution is the proportion of unnatural (man-made) sound heard in the parks. AutoPARM, software developed by the investigators, can be used to help segment the audio tapes into homogeneous pieces of natural and unnatural (e.g., snow mobile and jet plane noise) sounds. After segmenting the sounds, we attempt to classify the individual pieces into various categories of known sound types. This research is of value to various government agencies, including the National Park Service.

With regards to Ritter’s work, the San Diego Metropolitan Wastewater District used Ritter’s design. Other stakeholders include the people and other living organisms who utilize near-coastal waters and adjacent beaches. Her work shows how to design cost-effective studies of the consequences of oceanic wastewater outfalls.

The immediate users of Urquhart’s work are agency personnel who have to design ecological surveys; these include state-level environmental scientists in many states, including Oregon, Wisconsin, and Maryland. Longer term we all benefit from better and more defensible information being gathered in cost-effective ways.

How Products Will Further Science/Management of Resources

The statistical methodology, software, textbook, papers, short courses, and talks all have and will continue to support the design of aquatic environmental studies and the analysis of the resulting data in diverse contexts. These tools and demonstrations will support the more accurate and defensible analysis of diverse environmental variables.

The training of statisticians versed in methods for modeling aquatic data has already resulted in new statisticians working in positions related to ecology and aquatic resources.

The design results of Ritter’s work will allow better management of wastewater outfalls. The associated current work of French demonstrates the uncertainty associated with the sorts of contours often computed by GIS applications, and used in regulatory statements.

Listing of Specific Communications Related to Combining Environmental Data Sets

The complete list of outputs from STARMAP, including those originating from Project 1, is available on the Web at Exit .

Journal Articles:

No journal articles submitted with this report: View all 78 publications for this subproject

Supplemental Keywords:

geostatistical modeling, computational statistics, latent processes, spatial covariance functions, model selection, sampling design,, RFA, Scientific Discipline, Ecosystem Protection/Environmental Exposure & Risk, Aquatic Ecosystems & Estuarine Research, Aquatic Ecosystem, Environmental Monitoring, EMAP, ecosystem monitoring, statistical survey design, spatial and temporal modeling, aquatic ecosystems, water quality, Environmental Monitoring and Assessment Program, modeling ecosystems

Relevant Websites: Exit Exit

Progress and Final Reports:

Original Abstract
  • 2002
  • 2003 Progress Report
  • 2004 Progress Report
  • 2005 Progress Report

  • Main Center Abstract and Reports:

    R829095    Space-Time Aquatic Resources Modeling and Analysis Program (STARMAP)

    Subprojects under this Center: (EPA does not fund or establish subprojects; EPA awards and manages the overall grant for this center).
    R829095C001 Combining Environmental Data Sets
    R829095C002 Local Inferences from Aquatic Studies
    R829095C003 Development and Evaluation of Aquatic Indicators
    R829095C004 Extension of Expertise on Design and Analysis to States and Tribes
    R829095C005 Integration and Coordination for STARMAP