Grantee Research Project Results
Final Report: Addressing Temporal Correlation, Incomplete Source Profile Information, and Varying Source Profiles in the Source Apportionment of Particulate Matter
EPA Grant Number: R832160Title: Addressing Temporal Correlation, Incomplete Source Profile Information, and Varying Source Profiles in the Source Apportionment of Particulate Matter
Investigators: Christensen, William F. , Reese, C. Shane
Institution: Brigham Young University
EPA Project Officer: Chung, Serena
Project Period: December 1, 2004 through November 30, 2007
Project Amount: $238,721
RFA: Source Apportionment of Particulate Matter (2004) RFA Text | Recipients Lists
Research Category: Air Quality and Air Toxics , Particulate Matter , Air
Objective:
Most pollution source apportionment studies utilize ambient measurements that are gathered consecutively. Notwithstanding, most source apportionment (SA) approaches neither account for the impact of this correlation on statistical estimation and inference nor exploit the additional information available in correlated data. Additional complications in SA studies occur when only partial source profile information is available, and when the source profiles evolve or vary over the measurement period. The proposed research has three objectives in addressing these issues.
- Address both the challenges and advantages presented by temporally correlated ambient data, and address the opportunity for improved source contribution estimates when the temporal resolution of ambient measures is improved.
- Develop the iterated confirmatory factor analysis (ICFA) approach, which can utilize partial source profile information and take on aspects of CMB analysis, confirmatory factor analysis (CFA), and exploratory factor analysis (EFA) by assigning varying degrees of constraint to each element of the estimated source profile matrix during the estimation process.
- Develop a Bayesian hierarchical model for source apportionment, and present an approach for evaluating not only the change in source contributions over time, but also the change in source profiles.
Summary/Accomplishments (Outputs/Outcomes):
We discuss our findings in three areas: ICFA (objective #2); Bayesian receptor modeling (objectives #1 and #3), and supplemental receptor modeling tools for SA of PM data. Throughout the discussion, we refer to the manuscripts that contain complete details on our work.
I. Iterated Confirmatory Factor Analysis (ICFA).
The ICFA approach has been developed and evaluated more thoroughly. In recent years, there has been increased interest in more flexible receptor modeling approaches which assume little knowledge about the nature of the pollution source profiles, but are still able to produce nonnegative and physically realistic estimates of pollution source contributions. Confirmatory factor analysis can yield a physically interpretable and uniquely estimable solution, but requires that at least some of the rows of the source profile matrix be known. In Christensen, Schauer, and Lingwall (2006), we discuss the iterated confirmatory factor analysis (ICFA) approach. ICFA can take on aspects of chemical mass balance analysis, exploratory factor analysis, and confirmatory factor analysis by assigning varying degrees of constraint to the elements of the source profile matrix when iteratively adapting the hypothesized profiles to conform to the data. Christensen, Schauer, and Linwall (2006) also provides motivation for the Bayesian modeling which will be carried out as part of this grant. ICFA is illustrated using PM2.5 data from Washington, D.C., and a simulation study illustrates the relative strengths of ICFA, chemical mass balance approaches, and positive matrix factorization (PMF).
II. Bayesian Receptor Modeling.
We have developed a Bayesian approach for multivariate receptor modeling. Approaches for receptor modeling are chosen based on the existence and quality of a priori source profile information. The simplest pollution source apportionment methods require profiles and are based on linear regression. In this context, the model is referred to as a chemical mass balance (CMB) model. At the other end of the spectrum, one might have no a priori information about source profiles. These situations require factor analysis methods to estimate both profiles and contributions, and the model is commonly referred to as a multivariate receptor model. Bayesian approaches can utilize prior distributions of varying degrees of vagueness/specificity on source profiles and source contribution amounts. This effectively allows Bayesian approaches to be employed at any point on the spectrum.
During the first year of the project, we began by developing a Bayesian approach near the CMB end of the spectrum. This new approach is a Bayesian alternative to the “effective variance (EV) solution” proposed by Watson, Cooper, and Huntzicker (1984) and implemented in the widely used EPA-CMB8.2 software (EPA, 2004). There are several advantages to the new approach. First, while the EV solution is tied to the given profiles regardless of their appropriateness for the airshed of interest, the Bayesian approach allows the data to move the profile values away from the potentially erroneous initial values in a coherent manner. The Britt and Luecke (1973) solution is a generalization of the EV solution and allows for evolution of the profiles during the iterative estimation process, but has been shown to be highly unstable when even moderate degrees of profile uncertainty exist (Christensen and Gunst, 2004). A second advantage of the Bayesian approach over the EV approach is that the effective variance solution is based upon several simplistic assumptions that will not be valid in practice (Christensen and Gunst, 2004). In contrast, the Bayesian CMB is unrestricted by such assumptions and allows for the utilization of lognormal or other realistic (non-Gaussian) error distributions.
In Lingwall, Christensen, and Reese (2008), we propose a simple, fully Bayesian approach for multivariate receptor modeling that allows for flexible and consistent incorporation of a priori information. The model uses a generalization of the Dirichlet distribution as the prior distribution on source profiles that allows great flexibility in the specification of prior information. Heavy-tailed lognormal distributions are used as priors on source contributions to match the nature of particulate concentrations. A simulation study based on the Washington, DC airshed shows that the model compares favorably to Positive Matrix Factorization (PMF), a standard analysis approach used for pollution source apportionment.
The proposed Bayesian approach provides a useful alternative to other methods used in multivariate receptor modeling. The fully Bayesian approach is attractive because it easily incorporates a wide range of a priori information into analysis and gives full distributional results rather than just point estimates for source profiles and contributions. The novel use of heavy-tailed lognormal distributions for the source contributions and for the distribution of the particulate measurements is scientifically satisfying. The use of a Generalized Dirichlet distribution for source profiles allows for great flexibility in multivariate specification of prior information about emission sources while constraining the solution to be physically meaningful.
The Bayesian approach allows us to consistently incorporate the a priori information into an analysis rather than adjusting results after a model has been fit or introducing target transformations a posteriori. In simulation, the approach has been found to compete favorably with PMF. The full distributional results obtained from the Bayesian approach gives the researcher a great deal of flexibility in addressing questions associated with potentially complex functions of estimated parameters. For example, Figure 1 shows the complete distribution for each day’s estimate of the secondary sulfate source. But we can also answer complex questions of interest that are not easily addressed in a traditional PSA framework. For example, one might be interested in the number of exceedance days for a specific source. Reducing the exceedance days for auto/diesel emissions may be a sub-goal related to the larger aim of reducing the number of PM2.5 threshold exceedance days. Let κ be the number of days (out of the total of 100) in which the auto/diesel source exceeds 10 μg/m3. Figure 2 gives the probability distribution for κ given the data. If we consider the posterior median as a point estimate for the auto/diesel source contribution, only three of the 100 study days have point estimates in excess of 10 μg/m3. But Figure 2 gives a more complete understanding of this variable. For example, the expected number of auto/diesel exceedance days is roughly 3.3 and the probability that the number of auto/diesel exceedance days surpasses 4 days is roughly 16%.
Figure 1. Posterior distributions for daily secondary sulfate formation. Posterior medians are shown in black.
Figure 2. Probability distribution for the number of days (out of the total of 100) in which the auto/diesel source exceeds 10 μg/m3.
In the final year of our grant period, we addressed the issue of temporally evolving source profiles in the context of the Bayesian multivariate receptor model. The majority of previous approaches to multivariate receptor modeling make the following two key assumptions: (1) measurements of pollution concentrations are independent and (2) pollution source profiles are constant through time. Despite these assumptions, the existence of temporal correlation among pollution concentrations and time-varying source profiles is commonly accepted. In Heaton, Reese, and Christensen (2008) an approach to multivariate receptor modeling is developed in which the temporal structure of pollution measurements is accounted for by modeling source profiles as a time-dependent Dirichlet process. The Dirichlet process (DP) pollution model is first evaluated using several simulated data sets and then applied to a physical data set of chemical species concentrations measured at the U.S. Environmental Protection Agency’s St. Louis-Midwest supersite. The DP model is found to be preferable to more traditional receptor models because the DP model requires fewer assumptions for its use, is physically justifiable, more accurately estimates model parameters, and is flexible enough to estimate complex quantities through the employment of Markov chain Monte Carlo parameter estimation techniques. While Heaton, Reese, and Christensen (2008) does not comprehensively compare the DP model to PMF, early simulations indicate distinct advantages to using the DP model over PMF. For example, the DP model performs at least as well as PMF under the assumption of constant source profiles, a key assumption for the use of PMF. The DP model also has the added flexibility of incorporating time-varying profiles and outperforms PMF when time-varying profiles are present. Thus, the DP model is preferred to PMF in that it requires fewer assumptions for its use, it often has better performance in simulations, and it facilitates distributional analysis of model parameters rather than mere point estimates.
An interesting feature of the DP model is the ability to identify seasonal changes in source profiles. For example, Figure 3 below displays the time plot of the six most prominent elements of the zinc smelter source. Notice that in Figure 3 the percentage of chlorine (Cl) changes season to season. Chlorine seems to be more prevalent in the winter than in the summer. The average profile value for chlorine is 0.043 and 0.082 for summer and winter, respectively. Using the DP model, the average value of chlorine in the zinc smelter profile over time is 0.06 compared to 0.05 when using PMF. Thus, the PMF estimate of profile element for chlorine in the zinc smelter appears to be a seasonal average while the DP model identifies seasonal trends. The seasonal variation in chlorine raises the question as to whether the source profile as observed at the source physically changes or whether the source profile is seasonally invariant but the observed decrease in chlorine results from the removal of chlorine by atmospheric processing during summer months. This seasonal phenomenon has posed a problem for traditional approaches to PSA. The analysis of the St. Louis data set by Lee, Hopke, and Turner (2006) removed chlorine from the data set to avoid this phenomenon while, as previously mentioned, Lingwall and Christensen (2007) found a yearly average when estimating a constant profile. The DP model is flexible enough to capture such important atmospheric phenomena as the seasonal variations in chlorine.
Figure 3. Time plot of the six largest elements of the zinc smelter profile as identified by the DP model. The dashed line represents the PMF estimate of the profile element.
III. Supplemental Receptor Modeling Tools for SA of PM Data.
Finally, we have developed several “supplementary” tools which have assisted us in our larger goals of developing a comprehensive approach for receptor modeling. Three supplemental issues have been explored during the course of our research: (IIIa) the optimal use and limitations of positive matrix factorization (PMF), (IIIb) the clustering of profile vectors for source identification, and (IIIc) source identification tools.
IIIa. The Optimal Use and Limitations of Positive Matrix Factorization. The basic PMF approach of Paatero and Tapper (1994) has proven to be a seminal tool in source apportionment research. Implemented in PMF2 (Paatero, 1998) and more recently in EPA-PMF1.1 (Eberly, 2005), the PMF algorithm has become widely used. In order to facilitate better comparison of competing approaches, a great deal of effort was spent in assessing and optimizing the performance of PMF using synthetic (but reasonably realistic) data. Specifically, because our research focuses on the incorporation of a priori information via a Bayesian hierarchical model, it became necessary to fully appreciate the ability of the current methods for incorporating such information. In Lingwall and Christensen (2007), the performance of profile element pulling (or “Fkeying”) and source profile targeting (or “Gkeying”) is considered and the potential of improving source contribution estimates is discussed. Recommendations to users of PMF are made and an illustration of PMF using St. Louis Supersite data (May 2001-May 2003) is presented. The use of source profile targeting shows much promise, both for incorporating well-established knowledge about pollution sources and as a tool for incremental exploratory analysis of the data.
Christensen and Schauer (2008a) consider the impact of species uncertainty on the solution stability of positive matrix factorization. Statistical measures for evaluating the similarity of different source apportionment solutions are proposed. The sensitivity of positive matrix factorization (PMF) to small perturbations in species measurement uncertainty estimates is examined. When considering each of PMF's source contribution estimates averaged across days, the effect of perturbations in the uncertainties is very small. However, depending on the pollution source type, the daily source contribution estimates can be surprisingly unstable when subjected to the small perturbations considered here. The stability of source profile estimates in our simulation varies greatly between sources. These findings confirm the notion that source apportionment results should be interpreted with caution. The process used for evaluation is a tool that may be used to assess the stability of solutions in source apportionment studies.
In related work, Christensen and Schauer (2008b) consider the least squares regression concept of influence as applied to the measured species in pollution source apportionment studies. We propose a new, iterative method for specifying the relative influence of groups of species in positive matrix factorization (PMF) and we evaluate the relative influence of elements and speciated organic compounds on source apportionment solutions. In a sample data set containing elements, speciated organic compounds, organic carbon, elemental carbon, and secondary inorganic ions measured at the St. Louis-Midwest Supersite, a subset of 40 elements and ions has roughly 28 times the influence of the subset of 38 organic species. By manipulating the collective influence of elements and organic species in a comprehensive data set, one can mimic an “elements-only analysis,” an “organics-only analysis,” or any hybrid of these two extremes. The up- or down-weighting of species influence can be used to explore the different types of sources that can be resolved from a large data set.
IIIb. Clustering Profile Vectors for Source Identification. This tool relates to the differentiability of pollution source profiles. This problem manifests itself in multicollinearity problems in regression-like methods such as the EV and Bayesian CMB approaches. In factor analysis problems, the issue is manifested in the difficulty of resolving closely related profiles (e.g., gasoline and diesel emissions). Dillner, Schauer, Christensen, and Cass (2005) use cluster analysis to group particle size distribution vectors and then use these clusters to identify pollution sources. We extend the work of Dillner, et al. (2005) to incorporate profile uncertainty vectors in a cluster analysis. In Christensen, Dillner, Schauer, and Reese (2007), it is noted that when profile uncertainty vectors associated with each profile vector is available, it is clear that one would almost always be benefited by using a newly proposed modified Mahalanobis distance metric instead of the standard Euclidean distance. Although the illustration in the manuscript discusses the clustering of particle size distribution vectors, we discuss the natural application to the clustering of source profile vectors in traditional source apportionment settings.
IIIc. Source Identification Tools. An important precursor to conducting a pollution source apportionment analysis is the identification of potential sources. A priori information about potential sources is particularly important for the Bayesian source apportionment approaches being developed as a part of this grant work. Some approaches such as PMF (Paatero and Tapper, 1994; Paatero, 1998) can be used in a purely exploratory fashion and do not require such a priori information. However, Lingwall and Christensen (2007) and Christensen, Schauer, and Lingwall (2006) indicate that with approaches like PMF and Iterated Confirmatory Factor Analysis, proper utilization of a priori information can substantially improve the accuracy of contribution and profile estimates.
In research conducted by Basil Williams (undergraduate) and Drs. Christensen and Reese, traditional approaches for synthesizing meteorological data with ambient pollutant measurements are being reconsidered and expanded. Figure 1 below represents new approaches for identifying Zinc sources near the St. Louis Supersite. The plot on the left shows a “weighted rose” diagram with petals representing the abundance of Zinc associated with days in which the daily vector mean wind direction is in the given angle class. The longest petal is centered at 210 degrees and has length of roughly 4.5. This means that when winds are coming from 210 degrees, Zinc levels tend to be 4.5 times as high as the overall average level Zinc. The plot on the right uses AERMOD dispersion modeling software to predict concentration levels from any site within a 20 km by 20 km grid. Correlating the daily predictions of concentration levels based on hourly meteorological data with the actual time series for a pollutant such as lead can be useful in identifying important sites. The plot shows a major source in the area of a known Zinc smelter and also gives evidence of a smaller source in the direction of a known steel production facility at 10 degrees. These are the two largest area sources according to the local Toxic Release Inventory.
Figure 3. New source identification methods under development
In Williams, Christensen, and Reese (2008), we develop a method for identifying pollution source directions using Bayesian regression and an assimilation of deterministic and stochastic models. This is an important part of the pollution source apportionment (PSA) problem, which entails identifying and describing pollution sources and their contributions. The interpretation of PSA frequently requires the identification of source directions, often as a post-analysis check to ensure that the contribution estimates are reasonable. Although other simple approaches have been developed for source direction identification, this is the first that develops a statistically rigorous approach for the estimation of direction uncertainty. MCMC is used to evaluate the complex relationship among observed pollutant concentrations, available meteorological information, and unknown source direction parameters. The method is flexible enough to identify multiple source directions for cases in which a species or source type of interest is emitted at more than one location, and Reversible Jump MCMC is used to evaluate the appropriate number of sources. Finally, a deterministic dispersion model is incorporated into the statistical model and evaluated at each iteration of the MCMC to more accurately describe the dispersion process.
Five pollutants in the St. Louis area were analyzed using a variety of methods in order identify the direction of the respective dominant pollution sources. A model based on the kernel of the von Mises density function was used to regress concentration on wind direction and wind speed. The simplicity of the model allows for ease of interpretation, relatively fast computation of the posterior distributions, and the easy insertion of additional sources into the model. In most cases, the von Mises-based model succeeds at generating credible intervals containing the “true” source direction, or clusters of “true” source directions, identified in the Toxic Release Inventory. We also used a model based on the EPA-endorsed computational dispersion model AERMOD, which despite requiring considerably more computational resources, allows a much more sophisticated, phenomenologically justifiable model. The AERMOD-based model incorporates significantly more meteorological data than the von Mises-based model allows, so the credible intervals of the estimated source direction are much more precise. We also used Reversible Jump MCMC to choose the appropriate number of sources to include in the model of each element. In most cases, the data supported the use of the two-source model. Figure 4 illustrates the strengths of the approach, illustrating the estimated direction and a 95% credible interval for the predominant copper source direction.
Figure 4. Estimated model for copper concentration as a function of wind direction, using the medians of the parameter posterior distributions. The solid line represents the direction of the primary source of copper in the area (a local copper smelter) and the dashed lines represent the 95% credible interval for the direction of the primary copper source.
Conclusions:
From this funded research, we have gained additional insight about the complex interactions between a priori subject matter knowledge, partial profile information, and temporally correlated air quality data. We have proposed and evaluated new approaches for optimizing existing source apportionment software and developed and illustrated new Bayesian approaches for receptor modeling of PM data. Throughout our research, we have emphasized the importance of accounting for all sources of variability via a comprehensive source apportionment paradigm. We also emphasize the challenge of using source apportionment output to answer real and often very complex questions. Instead of merely giving policy makers a daily estimate of a particular source contribution, we have focused on developing approaches that are flexible enough to answer complex questions like “what is the probability that combined industrial source emissions will exceed 5 μg/m3 at least 10 days each year?” Initial findings indicate that using Bayesian hierarchical modeling to integrate air quality data, expert opinion, and partial source profile information will provide more accurate and useful information for policy makers seeking to effectively promote public health. The approaches and results from our study have been carefully documented in published (or publicly available) manuscripts.
Journal Articles on this Report : 8 Displayed | Download in RIS Format
Other project views: | All 36 publications | 10 publications in selected types | All 8 journal articles |
---|
Type | Citation | ||
---|---|---|---|
|
Christensen WF, Schauer JJ, Lingwall JW. Iterated confirmatory factor analysis for pollution source apportionment. Environmetrics 2006;17(6):663-681. |
R832160 (2005) R832160 (2006) R832160 (Final) |
Exit |
|
Christensen WF, Dillner AM, Schauer JJ, Reese CS. Clustering composition vectors using uncertainty information. Environmetrics 2007;18(8):859-869. |
R832160 (Final) |
Exit |
|
Christensen WF, Schauer JJ. Impact of species uncertainty perturbation on the solution stability of positive matrix factorization of atmospheric particulate matter data. Environmental Science & Technology 2008;42(16):6015-6021. |
R832160 (Final) |
Exit Exit Exit |
|
Christensen WF, Schauer JJ. Quantifying and manipulating species influence in positive matrix factorization. Chemometrics and Intelligent Laboratory Systems 2008;94(2):140-148. |
R832160 (Final) |
Exit Exit Exit |
|
Heaton MJ, Reese CS, Christensen WF. Incorporating time-dependent source profiles using the Dirichlet distribution in multivariate receptor models. Technometrics 2010;52(1):67-79. |
R832160 (Final) |
Exit Exit |
|
Lingwall JW, Christensen WF. Pollution source apportionment using a priori information and positive matrix factorization. Chemometrics and Intelligent Laboratory Systems 2007;87(2):281-294. |
R832160 (Final) |
Exit Exit Exit |
|
Lingwall JW, Christensen WF, Reese CS. Dirichlet based Bayesian multivariate receptor modeling. Environmetrics 2008;19(6):618-629. |
R832160 (Final) |
Exit |
|
Williams B, Christensen WF, Reese CS. Pollution source direction identification: embedding dispersion models to solve an inverse problem. Environmetrics 2011;22(8):962-974. |
R832160 (Final) |
Exit Exit |
Supplemental Keywords:
receptor model, chemical mass balance model, Bayesian analysis, statistics, modeling, decision making, air quality models , Air, Ecosystem Protection/Environmental Exposure & Risk, RFA, Scientific Discipline, Air Quality, Atmospheric Sciences, Environmental Chemistry, Environmental Engineering, Environmental Monitoring, Monitoring/Modeling, particulate matter, Bayesian hierarchical model, aerosol analyzers, air quality model, air quality models, air sampling, airborne particulate matter, analytical chemistry, area of influence analysis, atmospheric chemistry, atmospheric dispersion models, atmospheric measurements, chemical characteristics, chemical speciation sampling, emissions monitoring, environmental measurement, iterated confirmatory factor analysis, model-based analysis, modeling studies, particle size measurement, particulate matter mass, particulate organic carbon, real-time monitoring, source apportionment, source receptor based methods,, RFA, Scientific Discipline, Air, Ecosystem Protection/Environmental Exposure & Risk, particulate matter, Air Quality, Environmental Chemistry, Monitoring/Modeling, Environmental Monitoring, Atmospheric Sciences, Environmental Engineering, particulate organic carbon, atmospheric dispersion models, atmospheric measurements, model-based analysis, area of influence analysis, source receptor based methods, source apportionment, chemical characteristics, emissions monitoring, environmental measurement, airborne particulate matter, air quality models, air quality model, air sampling, speciation, particulate matter mass, Bayesian hierarchical model, analytical chemistry, iterated confirmatory factor analysis, modeling studies, real-time monitoring, aerosol analyzers, chemical speciation sampling, particle size measurementProgress and Final Reports:
Original AbstractThe perspectives, information and conclusions conveyed in research project abstracts, progress reports, final reports, journal abstracts and journal publications convey the viewpoints of the principal investigator and may not represent the views and policies of ORD and EPA. Conclusions drawn by the principal investigators have not been reviewed by the Agency.