2011 Progress Report: Improving Particulate Matter Source Apportionment for Health Studies: A Trained Receptor Modeling Approach with Sensitivity, Uncertainty and Spatial AnalysesEPA Grant Number: R833866
Title: Improving Particulate Matter Source Apportionment for Health Studies: A Trained Receptor Modeling Approach with Sensitivity, Uncertainty and Spatial Analyses
Investigators: Russell, Armistead G. , Klein, Mitchel , Mulholland, James , Sarnat, Stefanie Ebelt , Sarnat, Jeremy , Tolbert, Paige
Current Investigators: Russell, Armistead G. , Klein, Mitchel , Marmur, Amit , Mulholland, James , Sarnat, Stefanie Ebelt , Sarnat, Jeremy , Tolbert, Paige
Institution: Georgia Institute of Technology , Emory University
EPA Project Officer: Ilacqua, Vito
Project Period: December 1, 2008 through November 30, 2012 (Extended to November 30, 2013)
Project Period Covered by this Report: December 1, 2010 through November 30,2011
Project Amount: $899,956
RFA: Innovative Approaches to Particulate Matter Health, Composition, and Source Questions (2007) RFA Text | Recipients Lists
Research Category: Health Effects , Particulate Matter , Air
As discussed in detail of in the 2010 progress report, the main objectives of this research are to test four hypotheses derived from ongoing source apportionment (SA)-based epidemiologic and air quality modeling studies:
- A receptor-based approach, trained using an ensemble of model results (including receptor and emissions-based models), can be developed that neither introduces excessive nor inhibits an appropriate level of day-to-day variability.
- The method can be applied to long-term data sets for use in acute health effect studies.
- The method can be used to temporally interpolate between observations (e.g., for data available every third day) and spatially interpolate between urban and rural monitors.
- Uncertainties can be propagated from SA model inputs to health analysis outputs, with outputs most sensitive to source profile inputs.
To test the hypotheses, a three-step chemical mass balance (CMB) approach has been developed for particulate matter (PM) SA that utilizes an ensemble of both source- and receptor-based approaches to train a CMB method for use in longer term applications. These three steps include:
- Averaging SA results, using weights based on method uncertainty, from four receptor models and one chemical transport model, the Community Multiscale Air Quality (CMAQ) model, to develop ensemble-based source impacts.
- Using the weighted source impacts (from Step 1) in an application of CMB with the Lipschitz Global Optimizer (CMB-LGO) to calculate nine ensemble-based source profiles (EBSPs); the source profiles developed include gasoline vehicles (GV), diesel vehicles (DV), dust (DUST), biomass burning (BURN), coal combustion (COAL), secondary organic carbon (SOC), SULFATE, NITRATE, and AMMONIUM.
- Using the EBSPs on a longer term data set of observations to develop improved source impacts.
As detailed in the 2010 progress report, we have focused on using the ensemble method’s Step 1 to gain new insights into uncertainties of ensemble results as well as source apportionment methods. One of the least understood aspects of source apportionment is that uncertainties in daily source impacts and overall method uncertainties have not been well characterized. Furthermore, they often use different methods, intrinsic to each SA method, which makes inter-comparison of SA method uncertainties difficult. In 2011, we refined the method developed in 2010. The two major changes to Step 1 that has been conducted without CMB-RG because this method and CMB-LGO are highly correlated. This new ensemble now is comprised of four methods (CMB-LGO, PMF, CMB-MM and CMAQ). Because most CMB analyses do not use CMB-LGO, we performed a sensitivity analysis using CMB-RG in lieu of CMB-LGO. The second major change is that we included a mixed weighting case where the initial ensemble uses equal weighting but the updated ensemble uses the new SA method uncertainties, as explained in detail below. We also validated the ensemble results by comparing with SOC estimates from another independent method and determined that mixed weighting is the most appropriate way to conduct the ensemble.
We have performed the ensemble method for July 2001, to represent summer, and January 2002, to represent winter, in a manner similar to Lee, et al. (2009). Three features of this work distinguish it from that of Lee, et al. (2009). First, we performed source apportionment using CMB-RG, CMB-LGO, and PMF using a data set for the Jefferson St. (JST) SEARCH site in Atlanta, GA from January 1, 1999 through December 31, 2004. Missing data were treated in the same manner as Marmur, et al. (2005). We did not include several fitting species because on the vast majority of days, they were below the detection limit. These species include: Al, As, Ba, Sb, Sn, and Ti. In addition, we focused on ensemble averaging using no weights (N=0) and weights of uncertainty squared (N = 2), 1/σN , where σ is the daily source impact uncertainty (Lee, et al., 2009) focused on weights using 1/σ. Finally, ensemble average uncertainties take into account the covariance of source impacts from the five SA methods.
We have developed a two-step method for determining source impact uncertainties. First, we average the five individual SA methods and determine uncertainties of the ensemble by using propagation of errors. Next, we estimate an updated uncertainty for each SA method to be equal to the root mean square error (RMSE) between each SA method and the ensemble. Subsequently, we estimate an updated uncertainty for the ensemble using these new SA uncertainties by propagation of errors. Three cases of weighting were examined. An equal weighting case (i.e. N = 0), an inverse square weighting case (N = 2) and a mixed case. In the mixed case, we estimate the initial ensemble average using equal weighting. In the second step, however, the updated ensemble average uses inverse square weighting with the RMSE between each SA method as weights.
One major consequence of setting the updated source impact uncertainties to the RMSE for the five individual SA methods is that the daily updated uncertainties for each source and method will have the same uncertainty regardless of the magnitude of source impact. Thus, whereas traditional SA results often have daily relative uncertainties that are constant, our work results in constant daily absolute uncertainties. We calculate updated uncertainties this way because square errors between each individual method and the ensemble do not, in general, correlate well with source impact, based on linear regression results.
Ensemble averaging results in reduction of zero impact days and provides results for every day of the data set and has reduced variability by averaging out excessively high and low source impact days. The ensemble avoids performing poorly for any particular source, a major limitation of traditional SA methods. The ensemble, for both seasons, has the lowest estimated relative uncertainty for all cases, when averaged across all sources (i.e., the average of the overall relative uncertainties for each source). The choice of weighting does not significantly change the overall relative uncertainties (taken here to mean the root mean square average of daily source impact uncertainties divided by average source impacts) in the ensemble averages for primary sources and SOC.
In summer, the ensemble, using inverse square weighting, has the lowest overall relative uncertainties (i.e., RMSE divided by average source impact) for BURN (49%), COAL (45%), and SOC (42%) and has the second lowest overall relative uncertainties for GV (77%), DV (36%) and DUST (62%). With equal weighting, the ensemble has the lowest overall relative uncertainties for DV (38%), DUST (48%) and BURN (35%), and has the second lowest uncertainties for GV (65%), COAL (39%) and SOC (40%). With mixed weighting, the ensemble has the lowest overall relative uncertainties for DV (36%), DUST (55%), BURN (33%), and SOC (29%). CMB-LGO has the lowest overall relative uncertainty for GV and CMAQ for COAL. The ensemble overall relative uncertainties in winter generally are higher than in summer. Also, source impacts in winter are more varied between methods than in summer leading to greater RMSEs between the SA methods and the ensemble.
We compared our SOC estimates with those of Pachon, et al. (2010), who compared the regression method, the EC Tracer Method, CMB-RG and PMF for estimating SOC at JST. They found that both CMB-RG and PMF have high overall uncertainties that ranged from 47% to 56% for CMB-RG and 59% to 120% for PMF in summer and winter, respectively. The regression method estimated SOC to be 1.68 ± 0.14 μg m-3 and 0.80 ± 0.11 μg m-3 in July 2001 and January 2002, respectively, and had the lowest overall relative uncertainty. Our results for summer (July 2001) were comparable to the regression method’s average impact and overall uncertainty for July 2001, but our estimates of uncertainties are higher for January 2002. The correlation of the ensemble-based SOC with the regression-based SOC is very encouraging since the regression method includes ozone concentrations, which are not used in any of the receptor models included in the ensemble.
We used the ensemble results to determine new source profiles, one that was representative of summer and one that was representative of winter. To determine the source profiles, we ran CMB-LGO in “reverse," where the source impacts were treated as the known quantity (i.e., the ensemble averages) and the source profiles were treated as the unknown. We then ran CMB-LGO for a data set at JST from 8/31/98–12/31/07 using these new EBSPs and with measurement based source profiles (MBSPs) (Marmur, et al., 2005). The long-term application source impacts using these two sets of source profiles are highly correlated for all sources except biomass burning and to a lesser extent SOC. In addition, using EBSPs resulted in an approximately 20% reduction of the chi-squared statistic. Zero impact days for diesel vehicles also were reduced using EBSPs. However, zero impact days for SOC increased using EBSPs, although the majority of these zero impact days were in winter (October–March).
Our health research partners at Emory University used these SA results in their health models. Preliminary results for Atlanta indicate that health impacts do not change significantly when using SA results derived from EBSPs.
The ensemble method then was applied to St. Louis Supersite (STL-SS) data. In this case, four receptor models and CMAQ were used to quantify the sources of PM2.5 impacting the STL-SS between June 2001 and May 2003. The receptor models utilized two independent datasets, one that included ions and trace elements and a second that incorporated 1-in-6 day organic molecular marker data. The ensemble method offered several improvements over the five individual SA techniques. Primarily, the ensemble method calculated source impacts on days when individual models either did not converge to a solution or did not have adequate input data to develop source impact estimates. Additionally, the ensemble method resulted in fewer days on which major emissions sources (e.g., secondary organic carbon and diesel vehicles) were estimated to have either a zero or negative impact on PM2.5 concentrations at the STL-SS. When compared with a traditional CMB approach using MBSPs, the ensemble method was associated with better fit statistics, including reduced chi-squared values and improved PM2.5 mass reconstruction.
The main driver for such similar results is that the new source profiles are very similar to the MBSPs. This is in part due to the limited ensemble days, which are limited by the number of available days of CMAQ results and CMB-MM. To address this, we are developing a Bayesian approach to ensemble. In the previous approach, the uncertainty of each method was treated as a constant (i.e., the RMSE). In the Bayesian approach, we treat each method’s uncertainty as itself having uncertainty. That is, the RMSE represents the average uncertainty, which can take on values from a distribution formulated in a Bayesian context. Using this approach, method uncertainties can be sampled N times from the formulated distribution. This gives N sets of uncertainties that can be used for N realizations of ensemble results for any given day. Thus, if there are K days of methods results, then there can be K*N ensemble results. There are two major consequences of this. First, ensemble results are more variable than ensemble averaging using the RMSEs, because the Bayesian formulation results in different weights for each ensemble average. Second, having K*N ensemble averages results in a distribution of K*N source profiles. For each day in the long term time series, M source profiles can be chosen from the distribution of K*N source profiles, resulting in a distribution of M source impacts for each day. The Bayesian formulation of SA method uncertainties, with subsequent random sampling from distributions, obviates the need for propagation of errors in estimating uncertainties. Further, random sampling and multiple realization of ensemble averages, source profiles and final source impacts, results in distributions that automatically propagate uncertainties.
We will apply the entire ensemble method to data sets from monitoring stations located in Yorkville, GA, and South Dekalb (Atlanta, GA) to assess regional differences and demonstrate applicability to other locations. Finally, we will use geospatial techniques to develop representative source impacts that can be used in epidemiologic modeling for metropolitan Atlanta, GA.
In addition, we will focus on assessing variability, applying this method to other locations, and using it in epidemiologic modeling. We will investigate, using a central-difference metric as a measure of variability, which will be applied for the 9.5 year data set, SA results using both EBSPs and MBSPs. We also will conduct time series filtering by using a Fourier transform method to better understand variability. We will be applying the procedure to a simulated JST data set that mimics other data sets that typically have only speciated PM2.5 data every 3 or 6 days and develop a method to interpolate data for days without measurements.