Grantee Research Project Results
Final Report: Using Multilevel Statistical Models to Address Representativeness and Data at Different Spatial and Temporal Scales
EPA Grant Number: R826763Title: Using Multilevel Statistical Models to Address Representativeness and Data at Different Spatial and Temporal Scales
Investigators: Berk, Richard , Ambrose, Richard
Institution: University of California - Los Angeles
EPA Project Officer: Hahn, Intaek
Project Period: October 1, 1998 through September 30, 2000
Project Amount: $414,149
RFA: Regional Scale Analysis and Assessment (1998) RFA Text | Recipients Lists
Research Category: Aquatic Ecosystems , Ecological Indicators/Assessment/Restoration
Objective:
The objectives of the research project were to develop extensions of multi-level statistical models and appropriate software to address how to use data from a set of research sites to determine when generalizations are appropriate.
Summary/Accomplishments (Outputs/Outcomes):
Multi-level statistical models are characterized by analyses undertaken simultaneously at different levels of aggregation or spatial/temporal scales. For example, one might study several reaches in a stream for a number of different research sites. Or one might study several transects in each of several forests. The basic idea in multi-level models is to have a regression equation characterizing relationships at the smaller, or micro, level, and then have one or more of the regression coefficients at the micro level be a function of predictors at the macro level. At the micro level, for instance, taxa richness may be a function of stream velocity (and other things). Then at the macro level, the regression coefficient linking stream velocity to taxa richness may be a function of proximity of the stream to land used for agriculture. Thus, one can address how the relationship between stream velocity and taxa richness varies (or not) in different locations, here with locale characterized by proximity to land use for agriculture. That is, one can learn when to generalize over sites and when not to generalize over sites. One also can learn how different temporal and/or spatial scales are linked.
These sorts of relationships also can be formulated as interaction effects within conventional regression models. However, the usual estimation procedures used for those models will not properly characterize the uncertainty in the results, so that the usual confidence intervals and hypothesis tests will be wrong. Special estimation procedures are required. Such procedures are well known and widely available in existing software.
Our goal was to extend multi-level models to more complicated and realistic situations. The first extension was to allow for spatial autocorrelation in the residuals of multi-level models. The problem addressed is that more proximate spatial units at the micro level can be expected to have disturbances that are more alike than spatial units at the micro level that are more distant from one another. Thus, transects that are closer together likely will have disturbances that are more alike than transects that are farther apart. Failing to take this spatial autocorrelation into account generally will lead to biased standard errors, and hence, inaccurate confidence intervals and hypothesis tests.
Formally, a good solution to this problem for linear regression can be found in a classic paper by J. Keith Ord. For the usual sorts of regression models, one constructs a matrix capturing the "distance'' between all micro units within each macro unit (e.g., transects within sites) and builds that information into the estimation process. We initially adopted this approach, introduced it into a multi-level formulation, and applied it to two data sets. One data set was collected to study biodiversity in streams located in Ventura County, California, and the other was collected to study the impact of marine preserves on biodiversity and total fish biomass in coral reefs in the Philippines.
The results were disappointing. First, there was essentially no theory or empirical research in ecology or related disciplines to inform in sufficient detail the construction of the distance matrix. One difficulty was that it was not clear how to measure distance given ocean currents, for example, the transport of nutrients occurs more readily between some locales than others. Another difficulty was that there are a number of different functions of that distance that could have been used in the distance matrix (e.g., exponential decay with increasing distance) and, again, no guidance from the scientific literature. It is our sense that similar problems are common for a wide variety of environmental applications.
Second, except for very simple and somewhat unrealistic models, the numerical methods used in the estimation did not perform well. There were several technical reasons, but a key obstacle was that the regression coefficients and the distance matrix were "competing" for the same information. This was because the predictors necessarily also contained spatial information. Micro units that were closer also were likely to be more similar in the values of predictors than micro units that were farther apart. For instance, such predictors could include composition of the streambed and the amount of shading from trees along the banks. Because of the competition for spatial information, the output from the statistical models tended to be very unstable. Small changes in the model or the data could introduce large changes in the output.
Finally, we planned to move beyond multi-level linear models to multilevel generalized linear models. That way, we would be able to include popular procedures such as logistic regression for binary outcomes and Poisson regression for count data. Unfortunately, the Ord approach led to effectively intractable mathematical problems when applied to generalized linear models.
These three difficulties forced us to reconsider the entire project and, the usual philosophy by which spatial modeling is undertaken. To begin, we suspect that for spatial regression models, far too much is made about the exact form of the distance matrix. With scant scientific guidance about how the distance matrix should be formulated, any one of several competing formulations can be applicable. But there is no way to know which. In addition, the distance matrix by itself is rarely of much scientific interest. Its usual role is to allow for more accurate estimates of the regression coefficients that are the real focus of scientific concern. In statistical parlance, the distance matrix represents a set of "nuisance parameters." At a deeper level, George Box's famous dictum that "all models are wrong, but some are useful," applies. Given the current state of subject-matter knowledge, it is naive to aim for the "right" model. And in the absence of the right model, many of the usual statistical concern become relatively unimportant. In particular, confidence intervals and tests no longer have much probative value. Rather, one should develop models that are descriptively informative, relatively simple, and that capture in broad-brush strokes the essential features of the empirical world at hand.
These and other considerations naturally led us to consider methods by which the distance matrix could be well approximated and in a manner that eliminated much of the instability produced by taking the Ord approach at face value. Two methods seem to be especially effective. One method extracts the eigenvectors of the distance matrix and uses the first few to adjust for spatial autocorrelation. This still requires, however, that a distance matrix be specified. The second method constructs simple functions of the spatial coordinates (e.g., longitude and latitude) and uses these to adjust for spatial autocorrelation. For example, one might include longitude, latitude, and their product. Analysis of real data and many simulations indicate that both methods work well, although the second method is somewhat simpler to implement. Moreover, one can in both cases improve the approximation of the distance matrix as much as desired by using more of its eigenvectors or more complicated functions of the spatial coordinates. That is, one can make the approximations arbitrarily close to the specified distance matrix (proofs available upon request), although at some point the instabilities reappear. Finally, we have developed novel algorithms for estimating multi-level linear models with spatial autocorrelation that have been implemented in our software. The Formal properties of these procedures also have been derived.
The work continues beyond this grant. With our new approach, we now can easily turn to multi-level generalized linear models with spatial autocorrelation. All of the pieces are now in place. It is important to emphasize again, however, that we have in important ways reformulated the manner in which the modeling is approached; we are no longer seeking the right model but rather a useful model.
Journal Articles:
No journal articles submitted with this report: View all 1 publications for this projectSupplemental Keywords:
modeling, analytical, statistical inference, external validity, hierarchical models., RFA, Economic, Social, & Behavioral Science Research Program, Ecosystem Protection/Environmental Exposure & Risk, Regional/Scaling, Environmental Statistics, data synthesis, regional environmental data, risk assessment, non-linear functional forms, ecosystem assessment, representativeness studies, multiple response variables, survey data, environmental risks, multilevel statistical model, hierarchical statistical inference, satellite data, modeling, external validity, statistical models, regional scale impacts, data analysis, spatial-temporal methods, spatial and temporal scales, representativeness, multiple response variable, data models, hierarchical statistical analysis, innovative statistical models, regional survey data, remotely sensed data, statistical methodsProgress and Final Reports:
Original AbstractThe perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Conclusions drawn by the principal investigators have not been reviewed by the Agency.